Page Comparison

...

Expand

title	Slurm submission script

Code Block

#!/bin/bash

#SBATCH --job-name=pyt-multi-gpu
#SBATCH --account=<project-name>
#SBATCH --time=5:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-address>
#SBATCH --output=pyt_multi_gpu_%A.out
#SBATCH --partition=<gpu-partition>
#SBATCH --gres=gpu:<num-of-gpus>

export OMP_NUM_THREADS=1
# Uncomment for NCCL related logging.
# export NCCL_DEBUG=INFO

echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST
echo "SLURM_GPUS:" $SLURM_GPUS
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES

# Environment Variable used within the code.
# Sets a random port to potentially allow different jobs to 
# run on the same GPU device.
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_PORT="$MASTER_PORT

# List GPU devices allocated.
nvidia-smi -L

module purge
module load miniconda3/<version>
conda activate <path-to-conda-environmnet>

# Monitor GPU usage
# Run this process in the background.
nvidia-smi \
--query-gpu=timestamp,count,gpu_name,gpu_uuid,utilization.gpu,utilization.memory,memory.total,memory.reserved,memory.used,memory.free,temperature.gpu,temperature.memory \
--format=csv -l 1 \
>  &
echo "Writing nvidia-smi to: gpu_usage.csv"

python multi_gpu.py 50 10

echo "Done."

Code/Script Comments:

Under the hood, the code uses the MASTER_PORT environment variable. This is set to a random number within the submission script to allow multiple jobs. If there all use the same value, and run on the same compute node, this can cause errors.
Recording the various torch and cuda versions and capabilities confirms that you have the intended environment and helps with triaging issues.
Within the submission script, the nvidia-smi tool is run as a background process to monitor the utilization of the gpu and its memory - this uses only a few of the available options.
- This uses the -l 1 option to loop the command every second.
- It also demonstrates an alternative to running a long interactive desktop via ondemand where users typically just watch using this command.
Within the submission script, use the export NCCL_DEBUG=INFO environment variable to log NCCL related functionality - useful if you’re seeing related issues.
Explicitly setting OMP_NUM_THREADS=1 removes related warnings.
For details and understanding of what the python code is running, view the related video from the link above.

Versions Compared

Old Version 6

New Version 7

Key

Code/Script Comments: