Goal: Introduction to what is Slurm and how to start interactive sessions, submit jobs and monitor.
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Workload Managers
Info |
---|
|
...
ARCC: Slurm: Wiki Pages
...
Slurm Related Commands
Core hour usage: chu_user, chu_account
...
Info |
---|
A quick read can be found under: Slurm: Getting Started-Jobs and Nodes ARCC also hosts a number of more detailed and specific wiki pages: |
...
Interactive Session: salloc
Info |
---|
You’re there doing the work. Suitable for developing and testing over a few hours. |
Code Block |
---|
[]$ salloc -–help []$ man salloc # Lots of options. # NoticeThe shortbare andminimum. long# formThis options. []$ salloc –A <project-name> -t <wall-time> # Format for: --time: Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". |
Interactive Session: salloc
: workshop
You’ll only use the
reservation
for this (and/or other) workshop.Once you have an account you typically do not need it.
But there are use cases when we can create a specific reservation for you.
Code Block |
---|
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name> |
Interactive Session: salloc
: What’s happening?
Code Block |
---|
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
Interactive Session: salloc
: Finished Early?
Code Block |
---|
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338 |
Submit Jobs: sbatch
You submit a job to the queue and walk away.
Monitor its progress/state using command-line and/or email notifications.
Once complete, come back and analyze results.
Submit Jobs: sbatch
: Template
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. #SBATCH --account=arccanetrain # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T')will provide the defaults of one node, one core and 1G of memory. []$ salloc –A <project-name> -t <wall-time> |
Info |
---|
|
...
Interactive Session: salloc
: workshop
Info |
---|
You’ll only use the Once you have an account you typically do not need it. But there are use cases when we can create a specific reservation for you. Which itself might require a |
Code Block |
---|
# CPU only compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name>
# GPU partition/compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name> |
...
Interactive Session: squeue
: What’s happening?
Info |
---|
Use the This list can be 10s/100s/1000s of lines long. Use the |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
...
Interactive Session: salloc
: Finished Early?
Info |
---|
If you finish using an This will stop the interactive session and release its associated resources back to the cluster and make them available for pending jobs. |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338 |
Info |
---|
Closing the session will also release the job. |
...
Exercise: salloc
: Give It A Go
From a login node, create some interactive sessions using: salloc
\.
Try different wall times:
Short times to experience an automatic timeout.
Longer times so you can call
squeue
and see your job in the queue.
Notice how the command-line prompt changes.
...
Submit Jobs: sbatch
Info |
---|
|
...
Submit Jobs: sbatch
: Example
Info |
---|
The following is an example bash submission script that we will use to submit a job to the cluster. It uses a short test python file defined here: python script. |
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT #put Cana callcomment bashafter commands.the echo "Start:" $start module load gcc/12.2.0 python/3.10.6shebang, this will cause an error. #SBATCH --account=<project-name> # Load the modules# youUse require#SBATCH forto yourdefine environment.Slurm pythonrelated python01values.py #SBATCH --time=10:00 # CallMust your scripts/commands. sleep 1m end=$(date +'%D %T')define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "EndSLURM_JOB_ID:" $end |
Submit Jobs: squeue
: What’s happening?
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u arcc-t05 $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') JOBID PARTITION NAME USER ST # Can call bash commands. echo "Start:" TIME$start module NODES NODELIST(REASON) purge module load gcc/14.2.0 python/3.10.6 # Load 13526340the modules you require for moranyour environment. python runpython01.shpy arcc-t05 R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out # YouCall can view this file while the job is still running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u arcc-t05your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
Note |
---|
|
...
Submit Jobs: squeue
: What’s happening?
Info |
---|
Remember: Use the |
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05<username> R 0:1705 1 m233 |
Submit Jobs: squeue
: What’s happening Continued?
Code Block |
---|
[]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ls python01.py run.sh slurm-13526340.out []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u <username> 13526340 moranJOBID PARTITION run.sh arcc-t05 RNAME 0:29USER ST 1 m233 []$TIME squeue -u arcc-t05NODES NODELIST(REASON) 13526340 JOBID PARTITION moran run.sh NAME<username> R USER ST 0:17 TIME NODES NODELIST(REASON) # squeue only shows pending and running jobs. # If a job is no longer in the queue then it has finished. # Finished can mean success, failure, timeout... It’s just no longer running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36 |
Submit Jobs: scancel
: Cancel?
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05 1 m233 |
Info |
---|
|
...
Submit Jobs: squeue
: What’s happening? Continued
Info |
---|
The If a job is no longer in the queue then it has finished. |
Code Block |
---|
[]$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:29 JOBID PARTITION 1 m233 NAME []$ squeue -u <username> USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME 13526341 USER ST moran run.sh arcc-t05 TIME R NODES NODELIST(REASON) []$ cat 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Submit Jobs: sacct:
What happened?
Code Block |
---|
[]$ sacct -u arcc-t05 -X JobIDslurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36 |
Info |
---|
|
...
More squeue
Information
Info | ||
---|---|---|
For more information see the main Slurm squeue page and use.
|
Info |
---|
For example, using the |
Code Block |
---|
[]$ squeue -u <username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" ACCOUNT USER JobNameJOBID Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 SUBMIT_TIME START_TIME interacti+ moran arccanetr+TIME_LEFT <project-name> <username> 1 TIMEOUT 1795458 0:0 13526338 interacti+ 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51 <project-name> moran arccanetr+ <username> 1 COMPLETED 1795453 0:0 13526340 run.sh 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 <project-name> moran arccanetr+ <username> 1 1795454 COMPLETED 0:0 13526341 2024-08-14T10:31:06 2024-08-14T10:31:07 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 # Lots more information []$ sacct --help []$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID6-04:42:49 ... |
Info |
---|
There are various other time related columns:
There are lots of other columns that can be defined including ones related to resources (nodes, cores, memory) that have been specifically allocated. |
...
Submission from your Current Working Directory
Info |
---|
Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD. By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD. In the example above, if I called Within the submission script you can define paths (absolute/relative) to other locations. |
Info |
---|
You can submit a script from any of your allowed locations But you need to manage and describe paths to scripts, data, output appropriately. |
...
Submit Jobs: scancel
: Cancel?
Info |
---|
If you have submitted a job, and for what ever reason you want/need to stop it early, then use This will stop the job at its current point within the computation, and return any associated resources back to the cluster. |
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526341 moran run.sh <username> R 0:03 1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Note |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
...
Submit Jobs: sacct:
What happened?
Info |
---|
Use the By default this will only list jobs from mid night of the that day. View the It too has a |
Code Block |
---|
[]$ sacct -u <username> -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0
13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0
13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0
13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0
[]$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28
13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06
13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01
13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
Info | ||
---|---|---|
For more information see the main Slurm sacct page and use:
|
...
Submit Jobs: sbatch
: Options
Info |
---|
Here are some of the common options available: |
Code Block |
---|
[]$ sbatch –-help
#SBATCH --account=<prohect-name> # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop # Job name: Help to identify when using squeue.
#SBATCH --nodes=1 # Options will typically have defaults.
#SBATCH --tasks-per-node=1 # Request resources in accordance to how you want
#SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition
#SBATCH --partition=mb # and if you require a GPU.
#SBATCH --gres=gpu:1
#SBATCH --mem=100G # Request specific memory needs.
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL # Get email notifications of the state of the job.
#SBATCH --mail-user=<email-address>
#SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id. |
Info |
---|
...
Submit Jobs: sbatch
: Options: Applied to Example
Info |
---|
Let’s take the previous example, and add some of the additional options: |
Code Block |
---|
#!/bin/bash #SBATCH --account=<project-name> #SBATCH --time=10:00 #SBATCH --reservation=<reservation-name> #SBATCH --job-name=pytest #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address> #SBATCH --output=slurms/pyresults_%A.out echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') Partition NNodes # Can NodeListcall bash commands. echo "Start:" $start NCPUSmodule purge module load ReqMemgcc/14.2.0 python/3.10.6 State# Load the modules you require for your environment. python python01.py Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 # Call your scripts/commands. sleep moran1m end=$(date +'%D %T') echo "End:" $end |
Info |
---|
Notice:
|
...
Extended Example: What Does the Run look Like?
Info |
---|
With the above settings (written into a file called |
...
...
Expand | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||
|
Submit Jobs: sbatch
: Options
Code Block |
---|
[]$ sbatch –-help
#SBATCH –-account=arccanetrain # Required: account/time
#SBATCH –-time=72:00:00
#SBATCH –-job-name=workshop # Job name: Help to identify when using squeue.
#SBATCH –-nodes=1 # Options will typically have defaults.
#SBATCH –-tasks-per-node=1 # Request resources in accordance to how you want
#SBATCH –-cpus-per-task=1 # to parallelize your job, type of hardware partition
#SBATCH –-partition=teton-gpu # and if you require a GPU.
#SBATCH –-gres=gpu:1
#SBATCH –-mem=100G # Request specific memory needs.
#SBATCH –-mem-per-cpu=10G
#SBATCH –-mail-type=ALL # Get email notifications of the state of the job.
#SBATCH –-mail-user=<email-address>
#SBATCH –-output=<prefix>_%A.out # Define a named output file postfixed with the job id. |
|
Info |
---|
In my inbox, I also received two emails with the subjects:
|
...
Exercise: sbatch
: Give It A Go
Using the script examples (adjust were appropriate) try submitting some jobs.
Once submitted (within a different session) monitor the jobs using the
squeue
command.Track the job ids, and try changing the job name to distinguish when viewing the pending/running jobs.
Cancel some of the jobs.
Maybe try increasing the
sleep
value to be longer than the requested wall time to trigger a timeout.Once they’ve completed, run
sacct
to view the finished jobs, and look at their state.
...
| Workshop Home | Next |