Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. GCC

  2. readline(-devel)

  3. MariaDB(-devel)

  4. Perl(-devel)

  5. lua(-devel)

  6. cURL(curl & libcurl(-devel))

  7. JSON (json-c(-devel))

  8. munge(-devel)

Code Block
languagec
[root@tmgt1 ~]# yum -y install \
  gcc \
  readline readline-devel \
  mariadb mariadb-devel \
  perl perl-devel \
  curl libcurl libcurl-devel \
  json-c json-c-devel \
  munge munge-devel munge-libs

...

PMIx is used to exchange information about the communications and launching platforms of parallel applications (i.e., mpirun, srun, etc.). the PMIx implementation is a launcher that prefers to do communications in conjunction with the job scheduler rather than using older RSH/SSH methods. The communications time to starting applications can also be reduced significantly at high node counts compared to the older ring start-up or even the Hydra implementation. The version presently installed was built as an RPM and inserted into the images or installed via repo. The version in EPEL is too old.

Code Block
languagec
powerman $ rpmbuild ...

UCX

This is presently an RPM which is a fine method to go as well, but instructions for compiling a source build are below. Will update with instructions on building the RPM as well. The current version that's installed is NOT from EPEL. That version was too old.

Code Block
languagec
powerman $ wget https://github.com/openucx/ucx/releases/download/v1.6.1/ucx-1.6.1.tar.gz

powerman $ tar xf ucx-1.6.1.tar.gz

powerman $ cd ucx-1.6.1

powerman $ ./configure \
  --prefix=/apps/s/ucx/1.6.1 \
  --enable-devel-headers \
  --with-verbs \
  --with-rdmacm \
  --with-rc \
  --with-ud \
  --with-dc \
  --with-dm

powerman $ make -j8

powerman $ make install

RPM Build

Code Block
languagec
TODO

HDF5

Compile HDF5 from source and put it in a systems directory not to be confused with the user-accessible HDF5 installations that may have additional dependencies like Intel compilers or an MPI implementation.

Code Block
languagec
powerman $ cd /software/slurm

powerman $ tar xf hdf5-1.10.5.tar.bz2

powerman $ cd hdf5-1.10.5

powerman $ ./configure --prefix=/apps/s/hdf5/1.10.5

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

Code Block
languagec
powerman $ cd /apps/s/hdf5

powerman $ ln -s 1.10.5 latest

...

Use the ultra-stable version of hwloc and install it in a global location. It can be used by users if needed, but otherwise, you can have a separate installation for user hwloc library when needed. This is specifically to address cgrouping within the system and used by Slurm.

Code Block
languagec
powerman $ cd /software/slurm

powerman $ tar xf hwloc-1.11.13.tar.bz2

powerman $ cd hwloc-1.11.13

powerman $ ./configure --prefix=/apps/s/hwloc/1.11.13

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

Code Block
languagec
powerman $ cd /apps/s/hwloc

powerman $ ln -s 1.11.13 latest

...

Assuming the downloaded version is '19.05.5.tar.bz2', instructions below. Also, reference the symbolic links for HDF5 and hwloc libraries as well.

Code Block
languagec
powerman $ cd /software/slurm

powerman $ tar xf slurm-19.05.5.tar.bz2

powerman $ cd slurm-19.05.5

powerman $ ./configure \
  --prefix=/apps/s/slurm/19.05.5 \
  --with-hdf5=/apps/s/hdf5/latest/bin/h5cc \
  --with-hwloc=/apps/s/hwloc/latest

powerman $ make -j8

powerman $ make install

Additional Features / Utilities

Code Block
languagec
powerman $ cd contribs
powerman $ make

powerman $ for i in lua openlava perlapi seff sjobexit torque;
do
  cd $i
  make install
  cd -
done

PAM libraries are special and if you want to use them, they go in a specific location that is node-local. Generally, /usr/lib64/security/. Consult your PAM manual pages to understand the implications of this and also your understanding of PAM.

Code Block
languagec
powerman $ for i in pam pam_slurm_adopt;
do
  cd $i
  echo $i
  cd -
done

...

Final Installation Process

Code Block
languagec
powerman $ unlink /apps/s/slurm/19.05

powerman $ ln -s /apps/s/slurm/19.05.5 /apps/s/slurm/19.05

powerman $ ln -s /apps/s/slurm/etc /apps/s/slurm/19.05/etc

powerman $ ln -s /apps/s/slurm/var /apps/s/slurm/19.05/var

...

The munge daemon is used to validate messages between slurm daemons to make sure that users are the correct users. Realistically, this only needs to be configured once at the start of the first installation of Slurm and then one used the same munge key on subsequent updates unless you're security paranoid. To generate a decent munge key, use the dd and either the /dev/random or /dev/urandom generators.

Code Block
languagec
root@tmgt1:~# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key

root@tmgt1:~# chown munge:munge /etc/munge/munge.key

root@tmgt1:~# chmod 600 /etc/munge/munge.key

You then need to place the munge key and daemon on all servers that you intend to run slurm commands with.

...

Then you'll need to add the slurm database and apply the appropriate user and passwords for the slurmdbd to communicate with the MariaDB server. The communication should happen over the localhost in this case for when you configure the credentials.

...

It's very important to Slurm that the upgrades happen in a certain order to make sure that continuous service is provided with backward-compatible communications schemes where clients talk to servers. Specifically, the ordering is as follows:

...

The Slurm database is a critical component in the ARCC infrastructure and the concepts of keeping accounts and investorship investors in line rely extensively on this database being active. Therefore, it's quite important to perform a back up of the database before attempting an upgrade. Use the normal MySQL/MariaDB backup capability to accomplish this. Also be aware that ARCC does not prune the database so far, but that may become an issue later if more high throughput computing is introduced.

Code Block
languagec
[root@tmgt1]# ssh tdb1

[root@tdb1]# systemctl stop slurmdbd.service

[root@tdb1]# ## PERFORM DB BACKUP ##

[root@tdb1]# install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmdbd.service /etc/systemd/system/slurmdbd.service

[root@tdb1]# systemctl daemon-reload

[root@tdb1]# su - slurm

bash-4.2$ cd /tmp

bash-4.2$ /apps/s/slurm/19.05.4/sbin/slurmdbd -D -vvv

Wait for Slurm to resume normal operations and completely make the database changes necessary. Once the changes are done continue, Ctrl-C to interrupt the process:

Code Block
languagec
bash-4.2$ ^c

[root@tdb1]# systemctl start slurmdbd.service

[root@tdb1]# exit

...

Make sure the database has been properly updated prior to doing the controller.

Code Block
languagec
[root@tmgt1] systemctl stop slurmctld.service

[root@tmgt1] install -m 0644 /software/slurm/slurm-19.05.5/etc/slurmctld.service /etc/systemd/system/slurmctld.service

[root@tmgt1] systemctl daemon-reload

[root@tmgt1] systemctl start slurmctld.service

...

If you don't want to reboot nodes because it would take to long or for some other reason, they can be updated live. This isn't a perfect method of doing things but should work.

Code Block
languagec
root@tmgt1:~# pscp -f 20 software/slurm/latest/etc/slurmd.service moran,teton:/etc/systemd/system/slurmd.service

root@tmgt1:~# psh -f 20 moran,teton systemctl stop slurmd.service

root@tmgt1:~# psh -f 20 moran,teton systemctl daemon-reload

root@tmgt1:~# psh -f 20 moran,teton systemctl start slurmd.service

root@tmgt1:~# psh -f 20 moran,teton "systemctl is-active slurmd.service" | xcoll

...

The image that is presently booted on using the RHEL image is t2018.08. The relevant part of Slurm to the compute node is the Systemd service file. The compute.postinstall script copies this file in to into the appropriate location provided the symbolic link in the /software/slurm directory is correct (i.e., latest -> 19.05.4). The Then use the standard xCAT tools to generate the image (genimage) and the pack and compress the image (packimage) using the root user.

Code Block
languagec
powerman $ unlink /software/slurm/latest

powerman $ ln -s /software/slurm/slurm-19.05.5 /software/slurm/latest

powerman $ # Need to switch to root user

root@tmgt1:~# genimage t2019.08

root@tmgt1:~# packimage -c pigz -m tar t2019.08

...

Check command versions ...

Code Block
languagec
$ sinfo --version

$ squeue --version

$ sbatch --version

$ scontrol --version

Controller & Database Checks

...

Intel MPI attempts to use the PMI library when with Slurm which is good. New versions also support PMIx which provides interfaces that are _backwards_ compatible with the PMI1 and PMI2 interfaces.

Code Block
languagec
$ salloc --nodes=2 --account=arcc --time=20

$ echo $I_MPI_PMI_LIBRARY

$ srun ...

You shoulnshouldn't get any PKVS failures if working properly.

...