Introduction: This workshop will introduce users to job management using the Slurm system - demonstrating how to create interactive jobs and submit jobs to the cluster queue that follow a basic workflow. After the workshop, participants will understand:
How to create a script that defines their workflow (i.e. loading modules).
Understand how to start interactive sessions to work within, as well as how to submit and track jobs on the cluster.
Participants will require an intro level of experience of using Linux, as well as the ability to use a text editor from the command line.
Course Goals:
What is Slurm?
How to start an Interactive sessions, and perform job submission
How to select appropriate resource allocations.
How to monitor your jobs.
What does a general workflow look like?
Best practices in using HPC.
How to be a good cluster citizen?
01: Slurm
Topics:
Slurm:
Interactive sessions.
Job submission.
Resource selection.
Monitoring.
Workload Managers:
Allocates access to appropriate computer nodes specific to your requests.
Framework for starting, executing, monitoring, and even canceling your jobs.
Queue management and job state notification.
ARCC: Slurm:
Exercises:
Core hour usage: chu_user, chu_account
Interactive Session: salloc
You’re there doing the work.
Suitable for developing and testing over a few hours.
[]$ salloc -–help # Lots of options. # Notice short and long form options. []$ salloc –A <project-name> -t <wall-time> # Format for: --time: Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
Interactive Session: salloc: workshop
You’ll only use the
reservation
for this (and/or other) workshop.Once you have an account you typically do not need it.
But there are use cases when we can create a specific reservation for you.
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name>
Interactive Session: salloc: What’s happening?
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526337 salloc: Nodes m233 are ready for job # Make a note of the job id. # Notice the server/node name has changed. [arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 0:19 1 m233 # For an interactive session: Name = interact # You have the command-line interactively available to you. []$ ... []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 1:03 1 m233 # Session will automatically time out []$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked. slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT *** exit srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Interactive Session: salloc: Finished Early?
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526338 salloc: Nodes m233 are ready for job [arcc-t05@m233 ...]$ Do stuff… []$ exit exit salloc: Relinquishing job allocation 13526338
Submit Jobs: sbatch
You submit a job to the queue and walk away.
Monitor its progress/state using command-line and/or email notifications.
Once complete, come back and analyze results.
Submit Jobs: sbatch: Template:
#!/bin/bash # Shebang indicating this is a bash script. #SBATCH --account=arccanetrain # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module load gcc/12.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end
Submit Jobs: sbatch: What’s happening?
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out # You can view this file while the job is still running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:17 1 m233
Submit Jobs: sbatch: What’s happening?
[]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:29 1 m233 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # squeue only shows pending and running jobs. # If a job is no longer in the queue then it has finished. # Finished can mean success, failure, timeout... It’s just no longer running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36
Submit Jobs: sbatch: Cancel?
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh arcc-t05 R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
Submit Jobs: sacct: What happened?
[]$ sacct -u arcc-t05 -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0 13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0 13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0 13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 # Lots more information []$ sacct --help []$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28 13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06 13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01 13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09
Submit Jobs: sbatch: Options:
[]$ sbatch –-help #SBATCH –-account=arccanetrain # Required: account/time #SBATCH –-time=72:00:00 #SBATCH –-job-name=workshop # Job name: Help to identify when using squeue. #SBATCH –-nodes=1 # Options will typically have defaults. #SBATCH –-tasks-per-node=1 # Request resources in accordance to how you want #SBATCH –-cpus-per-task=1 # to parallelize your job, type of hardware partition #SBATCH –-partition=teton-gpu # and if you require a GPU. #SBATCH –-gres=gpu:1 #SBATCH –-mem=100G # Request specific memory needs. #SBATCH –-mem-per-cpu=10G #SBATCH –-mail-type=ALL # Get email notifications of the state of the job. #SBATCH –-mail-user=<email-address> #SBATCH –-output=<prefix>_%A.out # Define a named output file postfixed with the job id.
If you don’t ask, you don’t get: GPU Example:
#!/bin/bash #SBATCH --account=arccanetrain #SBATCH --time=1:00 #SBATCH --reservation=HPC_workshop #SBATCH --partition=teton-gpu #SBATCH --gres=gpu:1 echo "SLURM_JOB_ID:" $SLURM_JOB_ID echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES nvidia-smi –L # Output: SLURM_JOB_ID: 13517905 SLURM_GPUS_ON_NODE: 1 SLURM_JOB_GPUS: 0 CUDA_VISIBLE_DEVICES: 0 GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f)
If you don’t ask, you don’t get: No GPU device requested:
# Comment out the gres option. ##SBATCH --gres=gpu:1 # Output: SLURM_JOB_ID: 13517906 SLURM_GPUS_ON_NODE: SLURM_JOB_GPUS: CUDA_VISIBLE_DEVICES: No devices found.
Just because a partition/compute node has something,
you still need to explicitly request it.
Common Questions:
How do I know what number of nodes, cores, memory etc to ask for my jobs?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
How long will I have to wait in the queue before my job starts? How busy is the cluster?
How do I monitor the progress of my job?
Common Questions: Suggestions:
How do I know what number of nodes, cores, memory etc to ask for my jobs?
Understand your software and application.
Read the docs – look at the help for commands/options.
Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?
Can it use a GPU? Nvidia cuda.
Are their suggestions on data and memory requirements?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
Consult the wiki: Beartooth Hardware Summary Table
How long will I have to wait in the queue before my job starts?
How busy is the cluster?
Current Cluster utilization: Commands
sinfo
/arccjobs
and SouthPass status page.
How do I monitor the progress of my job?
Slurm commands:
squeue
Common Issues:
Not defining the
account
andtime
options.The
account
is the name of the project you are associated with. It is not your username.Requesting combinations of resources that can not be satisfied: Beartooth Hardware Summary Table
For example, you can not request 40 cores on a
teton
node (max of 32).Requesting too much memory, or too many GPU devices with respect to a partition.
My job is pending? Why?
Because the resources are currently not available.
Have you unnecessarily defined a specific partition (restricted yourself) that is busy?
We only have a small number of GPUs.
This is a shared resource - sometimes you just have to be patient…
Check current cluster utilization.
Preemption: Users of an investment get priority on their hardware.
We have the
non-investor
partition.
02: Workflows and Best Practices
Topics:
What does a general workflow look like?
Best practices in using HPC.
How to be a good cluster citizen?
What does a general workflow look like?
Getting Started:
Understand your application / programming language.
What are its capabilities / functionality.
Read the documentation, find examples, online forums – community.
Develop/Try/Test:
Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a New Software Request to get installed.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?
What does a general workflow look like?
Production:
Put it all together within a bash Slurm script:
Request appropriate resources using
#SBATCH
Request appropriate wall time – hours, days…
Load modules:
module load …
Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
Use:
sbatch <script-name.sh>
Monitor job(s) progress.
What does it mean for an application to be parallel?
Read the documentation and look at the command’s help: Does it mention:
Threads - multiple cpus/cores: Single node, single task, multiple cores.
Example: Chime
OpenMP: Single task, multiple cores. Set environment variable.
an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.
Example: ImageMagick
MPI: Message Passing Interface: Multiple nodes, multiple tasks
OpenMPI: ARCC Wiki: OpenMPI and oneAPI Compiling,
Hybrid: MPI / OpenMP and/or threads.
Examples: DFTB and Quantum Espresso
What does it mean for an application to be GPU enabled?
Read the documentation and look at the command’s help: Does it mention:
GPU / Nvidia / Cuda?
Examples:
Applications: AlphaFold and GPU Blast
Via conda based environments built with GPU libraries - and converted to Jupyter kernels:
Examples: TensorFlow and PyTorch
Jupyter Kernels: PyTorch 1.13.1
How can I be a good cluster citizen?
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
mem=0
exclusive tag.
Only ask for a GPU if you know it’ll be used.
Use
/lscratch
for I/O intensive tasks rather than accessing/gscratch
over the network.You will need to copy files back before the job ends.
Track usage and job performance:
seff <jobid>
Being a good Cluster Citizen: Requesting Resources:
Good Cluster Citizen:
Only request what you need.
Unless you know your application:
can utilize multiple nodes/tasks/cores, request a single node/task/core (default).
can utilize multiple nodes/tasks/cores, requesting them will not make your code magically run faster.
is GPU enabled, having a GPU will not make your code magically run faster.
Within your application/code check that resources are actually being detected and utilized.
Look at the job efficiency: job performance:
seff <jobid>
This is emailed out if you have Slurm email notifications turned on.
Slurm cheatsheet
Job Efficiency:
[]$ seff 13515489 Job ID: 13515489 Cluster: beartooth User/Group: salexan5/salexan5 State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:05 CPU Efficiency: 27.78% of 00:00:18 core-walltime Job Wall-clock time: 00:00:18 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)
Note:
Only accurate is the job is successful.
If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.
This is emailed out if you have Slurm email notifications turned on.
03: Wrapping up the Workshop
Next Steps to look at:
Future Workshops:
Using SouthPass
Data Access and Transfers
Look at:
Slurm: Requesting multiple cores/nodes, memory and GPUs.
Software Installation.
Conda: Creating and using environments.
Convert a conda environment to a Jupyter kernel.
Getting data on/off the cluster.
Summary:
Covered:
Slurm: Interactive sessions, job submission, resource selection and monitoring.
What does a general workflow look like?
Best practices in using HPC.
How to be a good cluster citizen?