  1. GCC

  2. readline(-devel)

  3. MariaDB(-devel)

  4. Perl(-devel)

  5. lua(-devel)

  6. cURL(curl & libcurl(-devel))

  7. JSON (json-c(-devel))

  8. munge(-devel)

Code Block
[root@tmgt1 ~]# yum -y install \
  gcc \
  readline readline-devel \
  mariadb mariadb-devel \
  perl perl-devel \
  curl libcurl libcurl-devel \
  json-c json-c-devel \
  munge munge-devel munge-libs


PMIx is used to exchange information about the communications and launching platforms of parallel applications (i.e., mpirun, srun, etc.). the PMIx implementation is a launcher that prefers to do communications in conjunction with the job scheduler rather than using older RSH/SSH methods. The communications time to starting applications can also be reduced significantly at high node counts compared to the older ring start-up or even the Hydra implementation. The version presently installed was built as an RPM and inserted into the images or installed via repo. The version in EPEL is too old.

Code Block
powerman $ rpmbuild ...


This is presently an RPM which is a fine method to go as well, but instructions for compiling a source build are below. Will update with instructions on building the RPM as well. The current version that's installed is NOT from EPEL. That version was too old.

Code Block
powerman $ wget

powerman $ tar xf ucx-1.6.1.tar.gz

powerman $ cd ucx-1.6.1

powerman $ ./configure \
  --prefix=/apps/s/ucx/1.6.1 \
  --enable-devel-headers \
  --with-verbs \
  --with-rdmacm \
  --with-rc \
  --with-ud \
  --with-dc \

powerman $ make -j8

powerman $ make install

RPM Build

Code Block


Compile HDF5 from source and put it in a systems directory not to be confused with the user-accessible HDF5 installations that may have additional dependencies like Intel compilers or an MPI implementation.

Code Block
powerman $ cd /software/slurm

powerman $ tar xf hdf5-1.10.5.tar.bz2

powerman $ cd hdf5-1.10.5

powerman $ ./configure --prefix=/apps/s/hdf5/1.10.5

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

Code Block
powerman $ cd /apps/s/hdf5

powerman $ ln -s 1.10.5 latest


Use the ultra-stable version of hwloc and install it in a global location. It can be used by users if needed, but otherwise, you can have a separate installation for user hwloc library when needed. This is specifically to address cgrouping within the system and used by Slurm.

Code Block
powerman $ cd /software/slurm

powerman $ tar xf hwloc-1.11.13.tar.bz2

powerman $ cd hwloc-1.11.13

powerman $ ./configure --prefix=/apps/s/hwloc/1.11.13

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

Code Block
powerman $ cd /apps/s/hwloc

powerman $ ln -s 1.11.13 latest


Assuming the downloaded version is '19.05.5.tar.bz2', instructions below. Also, reference the symbolic links for HDF5 and hwloc libraries as well.

Code Block
powerman $ cd /software/slurm

powerman $ tar xf slurm-19.05.5.tar.bz2

powerman $ cd slurm-19.05.5

powerman $ ./configure \
  --prefix=/apps/s/slurm/19.05.5 \
  --with-hdf5=/apps/s/hdf5/latest/bin/h5cc \

powerman $ make -j8

powerman $ make install

Additional Features / Utilities

Code Block
powerman $ cd contribs
powerman $ make

powerman $ for i in lua openlava perlapi seff sjobexit torque;
  cd $i
  make install
  cd -

PAM libraries are special and if you want to use them, they go in a specific location that is node-local. Generally, /usr/lib64/security/. Consult your PAM manual pages to understand the implications of this and also your understanding of PAM.

Code Block
powerman $ for i in pam pam_slurm_adopt;
  cd $i
  echo $i
  cd -


Final Installation Process

Code Block
powerman $ unlink /apps/s/slurm/19.05

powerman $ ln -s /apps/s/slurm/19.05.5 /apps/s/slurm/19.05

powerman $ ln -s /apps/s/slurm/etc /apps/s/slurm/19.05/etc

powerman $ ln -s /apps/s/slurm/var /apps/s/slurm/19.05/var


The munge daemon is used to validate messages between slurm daemons to make sure that users are the correct users. Realistically, this only needs to be configured once at the start of the first installation of Slurm and then one used the same munge key on subsequent updates unless you're security paranoid. To generate a decent munge key, use the dd and either the /dev/random or /dev/urandom generators.

Code Block
root@tmgt1:~# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key

root@tmgt1:~# chown munge:munge /etc/munge/munge.key

root@tmgt1:~# chmod 600 /etc/munge/munge.key

You then need to place the munge key and daemon on all servers that you intend to run slurm commands with.


Then you'll need to add the slurm database and apply the appropriate user and passwords for the slurmdbd to communicate with the MariaDB server. The communication should happen over the localhost in this case for when you configure the credentials.


It's very important to Slurm that the upgrades happen in a certain order to make sure that continuous service is provided with backward-compatible communications schemes where clients talk to servers. Specifically, the ordering is as follows:


The Slurm database is a critical component in the ARCC infrastructure and the concepts of keeping accounts and investorship investors in line rely extensively on this database being active. Therefore, it's quite important to perform a back up of the database before attempting an upgrade. Use the normal MySQL/MariaDB backup capability to accomplish this. Also be aware that ARCC does not prune the database so far, but that may become an issue later if more high throughput computing is introduced.

Code Block
[root@tmgt1]# ssh tdb1

[root@tdb1]# systemctl stop slurmdbd.service

[root@tdb1]# ## PERFORM DB BACKUP ##

[root@tdb1]# install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmdbd.service /etc/systemd/system/slurmdbd.service

[root@tdb1]# systemctl daemon-reload

[root@tdb1]# su - slurm

bash-4.2$ cd /tmp

bash-4.2$ /apps/s/slurm/19.05.4/sbin/slurmdbd -D -vvv

Wait for Slurm to resume normal operations and completely make the database changes necessary. Once the changes are done continue, Ctrl-C to interrupt the process:

Code Block
bash-4.2$ ^c

[root@tdb1]# systemctl start slurmdbd.service

[root@tdb1]# exit


Make sure the database has been properly updated prior to doing the controller.

Code Block
[root@tmgt1] systemctl stop slurmctld.service

[root@tmgt1] install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmctld.service /etc/systemd/system/slurmctld.service

[root@tmgt1] systemctl daemon-reload

[root@tmgt1] systemctl start slurmctld.service


If you don't want to reboot nodes because it would take to long or for some other reason, they can be updated live. This isn't a perfect method of doing things but should work.

Code Block
root@tmgt1:~# pscp -f 20 software/slurm/latest/etc/slurmd.service moran,teton:/etc/systemd/system/slurmd.service

root@tmgt1:~# psh -f 20 moran,teton systemctl stop slurmd.service

root@tmgt1:~# psh -f 20 moran,teton systemctl daemon-reload

root@tmgt1:~# psh -f 20 moran,teton systemctl start slurmd.service

root@tmgt1:~# psh -f 20 moran,teton "systemctl is-active slurmd.service" | xcoll


The image that is presently booted on using the RHEL image is t2018.08. The relevant part of Slurm to the compute node is the Systemd service file. The compute.postinstall script copies this file in to into the appropriate location provided the symbolic link in the /software/slurm directory is correct (i.e., latest -> 19.05.4). The Then use the standard xCAT tools to generate the image (genimage) and the pack and compress the image (packimage) using the root user.

Code Block
powerman $ unlink /software/slurm/latest

powerman $ ln -s /software/slurm/slurm-19.05.5 /software/slurm/latest

powerman $ # Need to switch to root user

root@tmgt1:~# genimage t2019.08

root@tmgt1:~# packimage -c pigz -m tar t2019.08


Check command versions ...

Code Block
$ sinfo --version

$ squeue --version

$ sbatch --version

$ scontrol --version

Controller & Database Checks


Intel MPI attempts to use the PMI library when with Slurm which is good. New versions also support PMIx which provides interfaces that are _backwards_ compatible with the PMI1 and PMI2 interfaces.

Code Block
$ salloc --nodes=2 --account=arcc --time=20


$ srun ...

You shoulnshouldn't get any PKVS failures if working properly.
