...
Under the hood, the code uses the
MASTER_PORT
environment variable. This is set to a random number within the submission script to allow multiple jobs. If there all use the same value, and run on the same compute node, this can cause errors.Recording the various
torch
andcuda
versions and capabilities confirms that you have the intended environment and helps with triaging issues.Within the submission script, the nvidia-smi tool is run as a background process to monitor the utilization of the gpu and its memory - this uses only a few of the available options.
This uses the
-l 1
option to loop the command every second.It also demonstrates an alternative to running a long interactive desktop via ondemand where users typically just watch using this command.
Within the submission script, use the export
NCCL_DEBUG=INFO
environment variable to log NCCL related functionality - useful if you’re seeing related issues.Explicitly setting
OMP_NUM_THREADS=1
removes related warnings.For details and understanding of what the actual python code is runningactually doing, view read the related video from the link above.