View Source

Introduction

The Slurm page introduces the basics of creating a batch script that is used on the command line with the sbatch command to submit and request a job on the cluster. This page is an extension that goes into a little more detail focusing on the use of the following slurm options:

mem
mem-per-cpu
gres: used to request GPUs.

A complete list of options can be found on the Slurm: sbatch manual page or by typing man sbatch from the command line when logged onto teton.

Aims: The aims of this page are to extend the user's knowledge:

regarding job allocation by introducing memory considerations, alongside core options.
of the options available and how they function when creating batch scripts for submitting jobs.

Note:

This is an introduction, and not a complete overview of all the possible options available when using Slurm.
There are always alternatives to what is listed on this page, and how they can be used.
Using the options and terminology on this page, ARCC is better able to support you.
There are no hard and fast rules for which configuration you should use. You need to:
- Understand how your application works with respect to use across a cluster and parallelism (if at all). For example is it built upon the concepts of MPI and OpenMP and can use multiple nodes and cores?
- Not all applications can use parallelism and instead you simply run 100s of separate tasks.

Please share with ARCC your experiences of various configurations for whatever application you use so we can share it with the wider UW (and beyond) research community.

Prerequisites: You:

have an understanding of HPC systems and have taken the Intro to HPC at UW (online)
can create a simple bash script as detailed on the Slurm parent and child pages.
You have followed and understood the Introduction to Job Submission 01: Nodes, Tasks and Processors page.

Memory Allocation

Previously we're talked about nodes having a maximum number of cores that can allocated, well they also have a maximum amount of memory that can be requested and allocated. Looking at the RAM (GB) column on the Beartooth Hardware Summary-- page, you can see that the RAM available across partitions varies from 64Gb up to 1024Gb.
NOTE: Just because a node has 1024Gb please do not to try grabbing it for your job. Remember:

You need to properly understand how your application uses parallelisation. Does it actually require 512Gb of memory or does it actually require 32 nodes with 32 cores on each? Notice from the overview page that the nodes that have 512Gb of RAM only support a maximum of 8 cores.
Some partitions are for specific investors, and preemption the act of stopping one or more "low-priority" jobs to let a "high-priority" job run might kick in and stop your job.
The cluster is for everyone, please be mindful towards your fellow researchers and only request what you really require. If we notice individuals abusing resources, and/or a PI can not access their investment, we will stop the individual jobs.

Using the mem option you can request the memory required on a node.

Options

Allocation

Comments

#SBATCH --nodes=1
#SBATCH --mem=8G

Allocated one node requiring 8G of memory.

Remember 1G = 1032M, so you have 8192M

#SBATCH --nodes=1
#SBATCH --mem=8

Allocated one node requiring 8M of memory.

Megabytes is the default.

#SBATCH --nodes=3
#SBATCH --mem=32G

Allocated three nodes, each requiring 32G

Each node is allocated the same amount of memory.

#SBATCH --nodes=1
#SBATCH --mem=132G
#SBATCH --partition=teton

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

You can not request more memory than is actually on the node.
Teton nodes have a maximum of 128G available, and you must actually request less than that.

#SBATCH --nodes=1
#SBATCH --mem=125G
#SBATCH --partition=teton

One node allocated requiring a 125G

On teton nodes this is the maximum that can be requested.

#SBATCH --nodes=1
#SBATCH --mem=126G
#SBATCH --partition=teton

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

You're trying to request too much memory.
If you really need this amount of memory either select an appropriate partition, or remove the partition option and see what you get.

#SBATCH --nodes=1
#SBATCH --mem=102.1G
#SBATCH --partition=teton

sbatch: error: invalid memory constraint 120.1G

You must define a whole number.

Using the seff jobid command you can check the amount of memory that was allocated / used. Running the command will display something like the following on the command line:

Job ID: jobid
Cluster: teton
User/Group: userid/groupid
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:01
CPU Efficiency: 50.00% of 00:00:02 core-walltime
Job Wall-clock time: 00:00:02
Memory Utilized: 3.89 MB
Memory Efficiency: 48.58% of 8.00 MB
...
Memory Efficiency: 0.03% of 8.00 GB

Using the mem-per-cpu option you can request that each cpu has this amount of memory available to it.

Remember, that you need to check the overall total amount of memory you're trying to allocate on a node. So calculate the total number of cores you're requesting for a node (ntasks-per-node * cpus-per-task and then multiple that by mem-per-cpu.

In the following examples, I am using the default ntasks-per-node value of 1:

Options

Total memory

Comments

#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
#SBATCH --partition=moran

8 * 8G = 64G

Some moran nodes have 128G available. Job submitted.

#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=12G
#SBATCH --partition=moran

8 * 12G = 96G

Job submitted.

#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=16G
#SBATCH --partition=moran

8 * 16G = 128G

sbatch: error: Batch job submission failed: Requested node configuration is not available

What happened here? We requested 128G and don't the moran nodes have 128G? Your total memory allocations has to be less than what the node allows.

Options

Total memory

Comments

#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=8G
#SBATCH --partition=teton

8 * 16G = 128G

sbatch: error: Batch job submission failed: Requested node configuration is not available

Same problem as before, teton nodes have a maximum of 128G.

#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=3G
#SBATCH --partition=teton

32 * 3 = 96

Job Submitted

#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=3.5G
#SBATCH --partition=teton

32 * 3.5G = 112G

sbatch: error: invalid memory constraint 3.5G

What happened here? Can't I request three and a half gigs? You can, but values have to be integer numbers, you can't define a decimal number.
What you can do is convert from G into M. But remember 1G does not equal 1000M, it actually equals 1024M. So 3.5G equals 3.5 * 1024 = 3584M.

Options

Total memory

Comments

#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=3584M
#SBATCH --partition=teton

2 * 3.5G = 112G

Job Submitted

#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=4000M
#SBATCH --partition=teton

less then 128G (not by much)

Job Submitted

#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=4096M
#SBATCH --partition=teton

equals 128G

sbatch: error: Batch job submission failed: Requested node configuration is not available

#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4000M
#SBATCH --partition=teton

less then 128G

Job Submitted

#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4096M
#SBATCH --partition=teton

equals 128G

sbatch: error: Batch job submission failed: Requested node configuration is not available

Note: Yes, this does contradict with the mem option having to be less than 125GB. This is due to the nuances of the Slurm allocation implementation.

Some Considerations

Shouldn't I always just request the best nodes? Consider the following: Teton nodes have 32 cores and a maximum of 128G, thus if you wanted to exclusively use the node and all the cores, the maximum you could request for each core is 4G. In comparison, the moran nodes (with 128G) have 16 cores, and can thus allocate a higher maximum of 8G. Maybe you could request two moran nodes (total of 32 cores) with each core having 8G, rather than a single teton node with each core having a lower 4G. This is a slightly contrived example, but hopefully it will get you thinking that popular better nodes are not always the best option. You job might actually get allocated resources quick rather than being queued.

Out-of-Memory Errors: Although you can allocated 'appropriate' resources, there is nothing stopping the actual application (behind the scenes so to speak) from trying to allocate and use more. So, in some cases the actual application will try to use more memory than is available on the node, and cause an out-of-memory error. Check the job .out/.err files for a message of the form:

slurmstepd: error: Detected 2 oom-kill event(s) in step 3280189.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: m121: task 32: Out Of Memory
srun: Terminating job step 3280189.0

For commercial applications there's nothing we can directly do, and even for open source software trying to track down the memory leak can be very time consuming.

Can we predict if this is going to happen? At this moment in time, no. But we can suggest that you:

Keep track of the resource options you submit, the inputs actually used for the application (as the more complicated the simulation the more memory usage) and if it successfully ran. What you'll develop is the experience of knowing that ' for a particular simulation you require this set of resources '.
Work with your colleagues who also use the same application to get a feel for what works for them.

Requesting GPUs

The first step in using GPUs is to understand on what Beartooth Hardware partition and type of GPU GPU hardware you want to use.
Updated: 20230206

Combinations:

Example

Partition

Comments

#SBATCH --gres=gpu:1

blank

sbatch: error: Batch job submission failed: Requested node configuration is not available.

#SBATCH --gres=gpu:2

blank

sbatch: error: Batch job submission failed: Requested node configuration is not available.

#SBATCH --gres=gpu:p100:1

blank

sbatch: error: Batch job submission failed: Requested node configuration is not available.

#SBATCH --gres=gpu:1
#SBATCH --partition=teton-gpu

teton-gpu

Allocates a single p100 on the teton-gpu partition.

#SBATCH --gres=gpu:3
#SBATCH --partition=teton-gpu

teton-gpu

sbatch: error: Batch job submission failed: Requested node configuration is not available

There are no nodes in the teton-gpu partition that have 5 GPUs. There are only two GPU devices available.

#SBATCH --gres=gpu:2
#SBATCH --partition=beartooth-gpu

beartooth-gpu

Allocates two a30 devices on the beartooth-gpu partition.

#SBATCH --partition=dgx
#SBATCH --gres=gpu:4
#SBATCH --nodelist=tdgx01

dgx

GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-b9ac5945-6494-eedd-795b-6eec42ab3e8c)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-3143b8f5-a348-cce9-4ad4-91c01618d7fd)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-5f01803f-6231-4241-41c9-8ca05dadf881)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-f1646ebd-75a2-53a9-b9df-8b7fc51fc26c)

#SBATCH --partition=dgx
#SBATCH --gres=gpu:v100:4
#SBATCH --nodelist=mdgx01

dgx

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-64dc6369-4c36-824d-182c-8e8f9c33f587)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-d4adb339-0dba-47db-e766-96b9cbc302b4)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-f7637d82-f0c0-15e6-da23-21216b9b8f33)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-5afe7813-6b87-b667-e0a3-8e04662357e8)

If you just want the V100s, and not concerned if it’s the 16G or 32G then you do not need to define nodelist.

Notes:

To request a specific type of GPU device you need to explicitly define the partition that tthat device can be found on.
To check your submission is GPU enabled and the type of GPU, use the $CUDA_VISIBLE_DEVICES environment variable and nvidia-smi -L within your batch script:
- Note: This environment variable is only set submitting scripts using sbatch, it is not set via an interactive session using salloc.

echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES
nvidia-smi -L

If you request more than one GPU, you'll get something of the form:

CUDA_VISIBLE_DEVICES: 0,1
GPU 0: NVIDIA A30 (UUID: GPU-b9614d02-bcc7-e75c-4c9c-ba3515f8c082)
GPU 1: NVIDIA A30 (UUID: GPU-4b8746a4-4f7f-93dc-0cd5-a8b166100bbd)

If no value appears for the environment variable, then this means GPU is not enabled.

Interactive Jobs:

You can request GPUs via an interactive job:

salloc --account=<account> --time=01:00:00 -N 1 -c 1 --partition=teton-gpu --gres=gpu:1