Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Under the hood, the code uses the MASTER_PORT environment variable. This is set to a random number within the submission script to allow multiple jobs. If there all use the same value, and run on the same compute node, this can cause errors.

  • Recording the various torch and cuda versions and capabilities confirms that you have the intended environment and helps with triaging issues.

  • Within the submission script, the nvidia-smi tool is run as a background process to monitor the utilization of the gpu and its memory - this uses only a few of the available options.

    • This uses the -l 1 option to loop the command every second.

    • It also demonstrates an alternative to running a long interactive desktop via ondemand where users typically just watch using this command.

  • Within the submission script, use the export NCCL_DEBUG=INFO environment variable to log NCCL related functionality - useful if you’re seeing related issues.

  • Explicitly setting OMP_NUM_THREADS=1 removes related warnings.

  • For details and understanding of what the actual python code is runningactually doing, view read the related video from the link above.