...
The code uses the
torchrun
functionality that works with Slurm to setup and run across multiple nodes.So you don’t have to hard code values, SLURM related environment variables are used within the
torchrun
command:the
--nnodes
option is set to"$SLURM_JOB_NUM_NODES"
.the
--nproc_per_node
option is set to"$SLURM_GPUS_ON_NODE"
.
Notice each node only runs one task, but this task has multiple gpus.
Randomly assign the
MASTER_PORT
environment variable within the submission script to allow multiple jobs potentially running across the same compute node.Recording the various
torch
andcuda
versions and capabilities confirms that you have the intended environment and helps with triaging issues.Within the submission script we use
srun
to start background tasks, one for each node, that use the nvidia-smi to monitor the utilization of the gpu and its memory - this uses only a few of the available options.The
srun
command uses the--overlap
option to allow these sub tasks to use the same resources allocated for the job, and thus be able to detect the allocated GPUs. Without this option, then thesrun
blocks the remaining elements within the submission script.This uses the
-l 1
option to loop the command every second.It also demonstrates an alternative to running a long interactive desktop via ondemand where users typically just watch using this command.
These background tasks will create an individual file for each node used.
Within the submission script, use the export
NCCL_DEBUG=INFO
environment variable to log NCCL related functionality - useful if you’re seeing related issues.Explicitly setting
OMP_NUM_THREADS=1
removes related warnings.For details and understanding of what the actual python code is actually doing, read the the link above.
...