The Slurm Workload Manager is a powerful and flexible workload manager used to schedule jobs on high performance computing (HPV) clusters. The Slurm Workload Manager is can be used to schedule jobs, control resource access, provide fairshare, implement preemption, and provide record keeping. All compute activity should be used from within a Slurm resource allocation (i.e., job). This page details the prerequites for using Slurm, software, and extensions need to run the program, configuration information and directions for updating Slurm.
ARCC utilizes Slurm is on Teton, Mount Moran, and Loren.
Contents
Training
Slurm Cheat Sheet Abigail Smith (Unlicensed) *upload PDF
Compiling and Installing
Prerequisites
Red Hat / EPEL provided dependencies (use yum with appropriate repositories configured):
GCC
readline(-devel)
MariaDB(-devel)
Perl(-devel)
lua(-devel)
cURL(curl & libcurl(-devel))
JSON (json-c(-devel))
munge(-devel)
[root@tmgt1 ~]# yum -y install \ gcc \ readline readline-devel \ mariadb mariadb-devel \ perl perl-devel \ curl libcurl libcurl-devel \ json-c json-c-devel \ munge munge-devel munge-libs
Mellanox provided dependencies (Use mlnxofedinstall script)
libibmad(-devel)
libibumad(-devel)
ARCC Supplied RPM
PMIx
UCX (slurm not currently configured to use it)
PMIx
PMIx is used to exchange information about the communications and launching platforms of parallel applications (i.e., mpirun, srun, etc.). the PMIx implementation is a launcher that prefers to do communications in conjunction with the job scheduler rather than using older RSH/SSH methods. The communications time to starting applications can also be reduced significantly at high node counts compared to the older ring start-up or even the Hydra implementation. The version presently installed was built as an RPM and inserted into the images or installed via repo. The version in EPEL is too old.
powerman $ rpmbuild ...
UCX
This is presently an RPM which is a fine method to go as well, but instructions for compiling a source build are below. Will update with instructions on building the RPM as well. The current version that's installed is NOT from EPEL. That version was too old.
powerman $ wget https://github.com/openucx/ucx/releases/download/v1.6.1/ucx-1.6.1.tar.gz powerman $ tar xf ucx-1.6.1.tar.gz powerman $ cd ucx-1.6.1 powerman $ ./configure \ --prefix=/apps/s/ucx/1.6.1 \ --enable-devel-headers \ --with-verbs \ --with-rdmacm \ --with-rc \ --with-ud \ --with-dc \ --with-dm powerman $ make -j8 powerman $ make install
RPM Build
TODO
HDF5
Compile HDF5 from source and put it in a systems directory not to be confused with the user-accessible HDF5 installations that may have additional dependencies like Intel compilers or an MPI implementation.
powerman $ cd /software/slurm powerman $ tar xf hdf5-1.10.5.tar.bz2 powerman $ cd hdf5-1.10.5 powerman $ ./configure --prefix=/apps/s/hdf5/1.10.5 powerman $ make -j4 powerman $ make install
If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.
powerman $ cd /apps/s/hdf5 powerman $ ln -s 1.10.5 latest
hwloc
Use the ultra-stable version of hwloc and install it in a global location. It can be used by users if needed, but otherwise, you can have a separate installation for user hwloc library when needed. This is specifically to address cgrouping within the system and used by Slurm.
powerman $ cd /software/slurm powerman $ tar xf hwloc-1.11.13.tar.bz2 powerman $ cd hwloc-1.11.13 powerman $ ./configure --prefix=/apps/s/hwloc/1.11.13 powerman $ make -j4 powerman $ make install
If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.
powerman $ cd /apps/s/hwloc powerman $ ln -s 1.11.13 latest
Slurm
Download the latest version that's available. Slurm is on a 9-month release cycle, but often has to build fixes for specific versions, CVEs for specific fixes, even potentially hotfixes that need to be addressed.
Assuming the downloaded version is '19.05.5.tar.bz2', instructions below. Also, reference the symbolic links for HDF5 and hwloc libraries as well.
powerman $ cd /software/slurm powerman $ tar xf slurm-19.05.5.tar.bz2 powerman $ cd slurm-19.05.5 powerman $ ./configure \ --prefix=/apps/s/slurm/19.05.5 \ --with-hdf5=/apps/s/hdf5/latest/bin/h5cc \ --with-hwloc=/apps/s/hwloc/latest powerman $ make -j8 powerman $ make install
Additional Features / Utilities
powerman $ cd contribs powerman $ make powerman $ for i in lua openlava perlapi seff sjobexit torque; do cd $i make install cd - done
PAM libraries are special and if you want to use them, they go in a specific location that is node-local. Generally, /usr/lib64/security/. Consult your PAM manual pages to understand the implications of this and also your understanding of PAM.
powerman $ for i in pam pam_slurm_adopt; do cd $i echo $i cd - done
NOTE: Honestly, the only one that should really be after if you find users abusing nodes is the pam_slurm_adopt code which can pull in users to their cgroups and not allow them access a node if they don't have a job on it. Additionally, remember that configuring PAM is more than just installing libraries and that the PAM stack (/etc/pam.d/...) will need to be modified appropriately. Will post an example at later date.
Final Installation Process
powerman $ unlink /apps/s/slurm/19.05 powerman $ ln -s /apps/s/slurm/19.05.5 /apps/s/slurm/19.05 powerman $ ln -s /apps/s/slurm/etc /apps/s/slurm/19.05/etc powerman $ ln -s /apps/s/slurm/var /apps/s/slurm/19.05/var
First Installation & Configuration
Munge
The munge daemon is used to validate messages between slurm daemons to make sure that users are the correct users. Realistically, this only needs to be configured once at the start of the first installation of Slurm and then one used the same munge key on subsequent updates unless you're security paranoid. To generate a decent munge key, use the dd and either the /dev/random or /dev/urandom generators.
root@tmgt1:~# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key root@tmgt1:~# chown munge:munge /etc/munge/munge.key root@tmgt1:~# chmod 600 /etc/munge/munge.key
You then need to place the munge key and daemon on all servers that you intend to run slurm commands with.
tmgt1
tdb1
tlog1
tlog2
ALL compute nodes
MariaDB
To use proper accounting with fairshare and associations, Slurm requires the use of a database, a MySQL compatible like MariaDB. Therefore, MariaDB should be configured properly on the tdb1 node to take advantage of the SSD storage in the system. Configure the basics of a normal MariaDB (root account, basic setup, etc.)
Then you'll need to add the slurm database and apply the appropriate user and passwords for the slurmdbd to communicate with the MariaDB server. The communication should happen over the localhost in this case for when you configure the credentials.
See the Slurm documentation for configuring this along with the slurmdbd.conf file.
Performing Upgrades
It's very important to Slurm that the upgrades happen in a certain order to make sure that continuous service is provided with backward compatible communications schemes where clients talk to servers. Specifically, the ordering is as follows:
Slurm database
Slurm controller
Slurm compute nodes
Updating the Slurm Database
The Slurm database is a critical component in the ARCC infrastructure and the concepts of keeping accounts and investorship in line rely extensively on this database being active. Therefore, it's quite important to perform a back up of the database before attempting an upgrade. Use the normal MySQL/MariaDB backup capability to accomplish this. Also be aware that ARCC does not prune the database so far, but that may become an issue later if more high throughput computing is introduced.
[root@tmgt1]# ssh tdb1 [root@tdb1]# systemctl stop slurmdbd.service [root@tdb1]# ## PERFORM DB BACKUP ## [root@tdb1]# install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmdbd.service /etc/systemd/system/slurmdbd.service [root@tdb1]# systemctl daemon-reload [root@tdb1]# su - slurm bash-4.2$ cd /tmp bash-4.2$ /apps/s/slurm/19.05.4/sbin/slurmdbd -D -vvv
Wait for Slurm to resume normal operations and completely make the database changes necessary. Once the changes are done continue, Ctrl-C to interrupt the process:
bash-4.2$ ^c [root@tdb1]# systemctl start slurmdbd.service [root@tdb1]# exit
Updating the Slurm Controller
Make sure the database has been properly updated prior to doing the controller.
[root@tmgt1] systemctl stop slurmctld.service [root@tmgt1] install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmctld.service /etc/systemd/system/slurmctld.service [root@tmgt1] systemctl daemon-reload [root@tmgt1] systemctl start slurmctld.service
Updating the Slurm Compute Nodes
Running Nodes
If you don't want to reboot nodes because it would take to long or for some other reason, they can be updated live. This isn't a perfect method of doing things but should work.
root@tmgt1:~# pscp -f 20 software/slurm/latest/etc/slurmd.service moran,teton:/etc/systemd/system/slurmd.service root@tmgt1:~# psh -f 20 moran,teton systemctl stop slurmd.service root@tmgt1:~# psh -f 20 moran,teton systemctl daemon-reload root@tmgt1:~# psh -f 20 moran,teton systemctl start slurmd.service root@tmgt1:~# psh -f 20 moran,teton "systemctl is-active slurmd.service" | xcoll
Compute Node Image
The image that is presently booted on using the RHEL image is t2018.08. The relevant part of Slurm to the compute node is the Systemd service file. The compute.postinstall script copies this file in to the appropriate location provided the symbolic link in the /software/slurm directory is correct (i.e., latest -> 19.05.4). The use the standard xCAT tools to generate the image (genimage) and the pack and compress the image (packimage) using the root user.
powerman $ unlink /software/slurm/latest powerman $ ln -s /software/slurm/slurm-19.05.5 /software/slurm/latest powerman $ # Need to switch to root user root@tmgt1:~# genimage t2019.08 root@tmgt1:~# packimage -c pigz -m tar t2019.08
Once this has been done, you can reboot one of the compute nodes for validation. However, make sure to upgrade the slurmdbd and slurmctld prior to doing these steps.
Slurm Validation
Version Checks
Check command versions ...
$ sinfo --version $ squeue --version $ sbatch --version $ scontrol --version
Controller & Database Checks
Make sure scontrol, sacctmgr, sacct, sreport, and job_submit.lua works appropriately...
PMI Checks
Make sure Intel launches appropriately...
Intel MPI attempts to use the PMI library when with Slurm which is good. New versions also support PMIx which provides interfaces that are _backwards_ compatible with the PMI1 and PMI2 interfaces.
$ salloc --nodes=2 --account=arcc --time=20 $ echo $I_MPI_PMI_LIBRARY $ srun ...
You shouln't get any PKVS failures if working properly.
Special Nodes
DGX Systems
The DGX system used to be installed locally since it was only one Ubuntu system installed as an _appliance_. However, once a second DGX system was procured, the installation was moved to an alternative GPFS directory and a special symbolic link is maintained to the proper OS version on the Ubuntu systems vs the RHEL/CentOS systems.
The same installation process applies as above, but only need to focus on the slurmd installation piece. See the _/software/slurm/jj.slurm.dgx_ file for directions. If you run into permission issues, use the appropriate mkdir, chown, chgrp commands to fix the installation directory to the Powerman user.
Users will need to add the libswitch-perl package via apt to get the torque wrappers working properly.