Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7
typeflat
separatorpipe

...

The first step in using GPUs is to understand on what Beartooth Hardware partition and type of GPU GPU hardware you want to use.
Updated: 20230206

Combinations:

Example

Partition

Comments

Code Block
#SBATCH --gres=gpu:1

blank

Code Block
sbatch: error: Batch job submission failed: Requested node configuration is not available.
Code Block
#SBATCH --gres=gpu:2

blank

Code Block
sbatch: error: Batch job submission failed: Requested node configuration is not available.
Code Block
#SBATCH --gres=gpu:p100:1

blank

Code Block
sbatch: error: Batch job submission failed: Requested node configuration is not available.
Code Block
#SBATCH --gres=gpu:1
#SBATCH --partition=teton-gpu

teton-gpu

Allocates a single p100 on the teton-gpu partition.

Code Block
#SBATCH --gres=gpu:3
#SBATCH --partition=teton-gpu

teton-gpu

Code Block
sbatch: error: Batch job submission failed: Requested node configuration is not available

There are no nodes in the teton-gpu partition that have 5 GPUs. There are only two GPU devices available.

Code Block
#SBATCH --gres=gpu:2
#SBATCH --partition=beartooth-gpu

beartooth-gpu

Allocates two a30 devices on the beartooth-gpu partition.

Notes:

  • To request a specific type of GPU device you need to explicitly define the partition that tthat device can be found on.

  • To check your submission is GPU enabled and the type of GPU, use the $CUDA_VISIBLE_DEVICES environment variable and nvidia-smi -L within your batch script:

    • Note: This environment variable is only set submitting scripts using sbatch, it is not set via an interactive session using salloc.

Code Block
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES
nvidia-smi -L

If you request more than one GPU, you'll get something of the form:

...

Code Block
#SBATCH --partition=dgx
#SBATCH --gres=gpu:4
#SBATCH --nodelist=tdgx01

dgx

Code Block
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-b9ac5945-6494-eedd-795b-6eec42ab3e8c)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-3143b8f5-a348-cce9-4ad4-91c01618d7fd)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-

...

5f01803f-

...

6231-

...

4241-

...

41c9-

...

8ca05dadf881)
GPU 

...

3: 

...

Tesla V100-SXM2-32GB (UUID: GPU-

...

f1646ebd-

...

75a2-

...

53a9-

...

b9df-

...

8b7fc51fc26c)

If no value appears for the environment variable, then this means GPU is not enabled.

  • DGX nodes: These are on a special partition that requires access to be given directly to user accounts or project accounts and likely require additional approval for access. There are two specific nodes:

    • mdgx01 using V100-SXM2-16GB

    • tdgx01 using V100-SXM2-32GB

    • To request these, first set the partition to dgx, and then to explicitly request a specific node use the nodelist option, without this option you could be allocated either node. The following example will use four GPUs on node tdgx01.

To ask for a specific GPU on a specific node, you’ll need to define --nodelist and the name of the node:

...

Code Block
#SBATCH --partition=dgx
#SBATCH --gres=gpu:v100:4
#SBATCH --nodelist=mdgx01

dgx

Code Block
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-64dc6369-4c36-824d-182c-8e8f9c33f587)
GPU 1: Tesla V100-SXM2-

...

16GB (UUID: GPU-

...

d4adb339-

...

0dba-

...

47db-

...

e766-

...

96b9cbc302b4)
GPU 2: Tesla V100-SXM2-

...

16GB (UUID: GPU-

...

f7637d82-

...

f0c0-

...

15e6-

...

da23-

...

21216b9b8f33)
GPU 3: Tesla V100-SXM2-

...

16GB (UUID: GPU-

...

5afe7813-

...

6b87-

...

b667-

...

If you just want the V100s, and not concerned if it’s the 16G or 32G:

...

e0a3-8e04662357e8)

If you just want the V100s, and not concerned if it’s the 16G or 32G then you do not need to define nodelist.

Notes:

  • To request a specific type of GPU device you need to explicitly define the partition that tthat device can be found on.

  • To check your submission is GPU enabled and the type of GPU, use the $CUDA_VISIBLE_DEVICES environment variable and nvidia-smi -L within your batch script:

    • Note: This environment variable is only set submitting scripts using sbatch, it is not set via an interactive session using salloc.

Code Block
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES
nvidia-smi -L

If you request more than one GPU, you'll get something of the form:

Code Block
CUDA_VISIBLE_DEVICES: 0,1
GPU 0: NVIDIA A30 (UUID: GPU-f7637d82b9614d02-f0c0bcc7-15e6e75c-da234c9c-21216b9b8f33ba3515f8c082)
GPU 31: Tesla V100-SXM2-16GBNVIDIA A30 (UUID: GPU-5afe78134b8746a4-6b874f7f-b66793dc-e0a30cd5-8e04662357e8a8b166100bbd)

Additionally, bash jobs need to be submitted using sbatch_dgxIf no value appears for the environment variable, then this means GPU is not enabled.

Interactive Jobs:

You can request GPUs via an interactive job:

Code Block
salloc --account=<account> --time=01:00:00 -N 1 -c 1 --partition=teton-gpu --gres=gpu:1

Summary

...