Slurm: Workflows and Best Practices
Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.
- 1 Default Resources
- 2 If you don’t ask, you don’t get: GPU Example
- 3 Modules and using salloc and sbatch
- 4 Modules and using salloc and sbatch: Best Practice
- 5 Track Your Job IDs
- 6 What does a general workflow look like?
- 7 What does it mean for an application to be parallel?
- 8 What does it mean for an application to be GPU enabled?
- 9 How can I be a good cluster citizen?
- 10 Being a good Cluster Citizen: Requesting Resources
- 11 Submitting Useful Tickets via the Portal
Default Resources
When you perform an salloc
/ sbatch
you will be provided with a default resource allocation if you do not explicitly request something: This will be:
one node.
one task per node.
one core per task.
no GPU.
default memory (this can be different depending on the partition.
If you don’t ask, you don’t get: GPU Example
Lets look at an example where we want to use a GPU device on a particular partition.
#!/bin/bash
#SBATCH --account=<project-name>
#SBATCH --time=1:00
#SBATCH --reservation=<reservation-name>
#SBATCH --partition=mb-l40s
#SBATCH --gres=gpu:1
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES
nvidia-smi -L
# Output:
SLURM_JOB_ID: 13517905
SLURM_GPUS_ON_NODE: 1
SLURM_JOB_GPUS: 0
CUDA_VISIBLE_DEVICES: 0
GPU 0: NVIDIA L40S (UUID: GPU-29a5b03e-e8f0-972b-6ae8-be4b3afe4ee0)
The SLURM_JOB_GPUS
and CUDA_VISIBLE_DEVICES
values represents the index of the GPU device(s), it doesn’t mean zero were allocated.
For example if we used: --gres=gpu:2
we would see something of the form:
SLURM_GPUS_ON_NODE: 2
SLURM_JOB_GPUS: 0,1
CUDA_VISIBLE_DEVICES: 0,1
GPU 0: NVIDIA L40S (UUID: GPU-4b274738-2abf-c818-ff97-d7548c769276)
GPU 1: NVIDIA L40S (UUID: GPU-dfab908b-ccd9-27ab-5856-26a46cf6f89e)
If you don’t ask, you don’t get: No GPU device requested
What happens if you don’t explicitly ask for a GPU device?
# Comment out the gres option.
##SBATCH --gres=gpu:1
# Output:
SLURM_JOB_ID: 13517906
SLURM_GPUS_ON_NODE:
SLURM_JOB_GPUS:
CUDA_VISIBLE_DEVICES:
No devices found.
Just because a partition/compute node has something,
you still need to explicitly request it.
Modules and using salloc and sbatch
Typically: Modules loaded, and environment variables that have been set on the login nodes will be inherited when you create an interactive salloc
session and or call an sbatch
.
[]$ module purge
[]$ module load gcc/13.2.0 r/4.4.0
[]$ ml
Currently Loaded Modules:
1) slurm/latest (S) 42) libxau/1.0.8
...
41) xproto/7.0.31
[]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1243593
salloc: Nodes mbcpu-025 are ready for job
[@mbcpu-025 ~]$ ml
Currently Loaded Modules:
1) slurm/latest (S) 15) libxml2/2.10.3 29) perl/5.38.0 43) libxdmcp/1.1.4 57) curl/8.4.0 71) openjdk/11.0.20.1_1
...
14) xz/5.4.1 28) gdbm/1.23 42) libxau/1.0.8 56) nghttp2/1.57.0 70) openblas/0.3.24
Modules and using salloc and sbatch: Best Practice
Although modules and environment variables are typically inherited, this is not good practice since we have observed cases where not everything has been inherited.
Also, when ARCC is asked to assist, typically we have no idea, and users forget, how an environment has been setup on a login node.
Best Practice: After performing an salloc
, or within the script you sbatch
-ed, perform a module purge
and then only module load
(including versions) what you explicitly know you need to use.
When requesting ARCC help, this is then documented within your scripts that are sbatch
-ed helping us to exactly replicate what you’ve done.
This is good reproducibility practice.
Track Your Job IDs
Typically for any questions submitted to ARCC that refer to submitted jobs and even interactive sessions, we will ask for the Job ID(s).
Remember: There are a number of ways that YOU can track you Job IDs.
Use
squeue
to find jobs currently running.Use
sacct
to find jobs that have completed.Add the
#SBATCH mail-type/mail-user
options to your submission script. You then have a record within the Inbox.Use the Slurm
$SLURM_JOB_ID
environment variable to write into your output files.
Including Job IDs will assist ARCC tremendously and ultimately help us to resolve your issues quicker.
What does a general workflow look like?
Getting Started:
Understand your application / programming language.
What are its capabilities / functionality.
Read the documentation, find examples, online forums – community.
Develop/Try/Test:
Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a HPC Software Consultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?
What does a general workflow look like? Continued.
Production:
Put it all together within a bash Slurm script:
Request appropriate resources using
#SBATCH
Request appropriate wall time – hours, days…
Load modules:
module load …
Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
Use:
sbatch <script-name.sh>
Monitor job(s) progress.
What does it mean for an application to be parallel?
Read the documentation and look at the command’s help: Does it mention:
Threads - multiple cpus/cores: Single node, single task, multiple cores.
Example: Chime
OpenMP: Single task, multiple cores. Set environment variable.
an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.
Example: ImageMagick
MPI: Message Passing Interface: Multiple nodes, multiple tasks
OpenMPI: ARCC Wiki: OpenMPI and oneAPI Compiling,
Hybrid: MPI / OpenMP and/or threads.
Examples: DFTB and Quantum Espresso
What does it mean for an application to be GPU enabled?
Read the documentation and look at the command’s help: Does it mention:
GPU / Nvidia / Cuda?
Examples:
Applications: AlphaFold and GPU Blast
Via conda based environments built with GPU libraries - and converted to Jupyter kernels:
Examples: TensorFlow and PyTorch
Jupyter Kernels: PyTorch 1.13.1
How can I be a good cluster citizen?
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
mem=0
exclusive tag.
Only ask for a GPU if you know it’ll be used.
Use
/lscratch
for I/O intensive tasks rather than accessing/gscratch
over the network.You will need to copy files back before the job ends.
Track usage and job performance:
seff <jobid>
Being a good Cluster Citizen: Requesting Resources
Good Cluster Citizen:
Only request what you need.
Unless you know your application:
can utilize multiple nodes/tasks/cores, request a single node/task/core (default).
can utilize multiple nodes/tasks/cores, requesting them will not make your code magically run faster.
is GPU enabled, having a GPU will not make your code magically run faster.
Within your application/code check that resources are actually being detected and utilized.
Look at the job efficiency: job performance:
seff <jobid>
This is emailed out if you have Slurm email notifications turned on.
Slurm cheatsheet
Submitting Useful Tickets via the Portal
When you submit a ticket asking for help we need details. Just saying “my job doesn’t work” doesn’t help us at all.
For any question, please consider adding, where possible:
A clear description of what the problem is. But remember we are not domain experts within your field, so consider terminology used.
What service were you using? OnDemand, and interactive session, submitted a job - how and what resources have you requested.
Job Ids and details of where scripts and log/output files can be found (e.g. folder locations).
Enough details so we can follow what you’ve done and see how you’ve tried to run something.
Enough so we can replicate the issue. Can the idea of reproducibility.
This includes steps such as module loads with versions, or maybe environment variables you’ve set, the conda environments you’ve activated.
What documentation/web pages are you basing things on and following?
Consider what have you recently changed? Can you point to versions that previously worked. You might actually be able to resolve this yourself.
Do not assume we know what you’ve done and how you’re doing it. A lot of time can be wasted when we try something that is different to how you’ve actually performed it.
If you don’t know, then please just say.
Help us, to help you!
Prev | Workshop Home | Next |