Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
Workload Managers
Allocates access to appropriate computer nodes specific to your requests.
Framework for starting, executing, monitoring, and even canceling your jobs.
Queue management and job state notification.
ARCC: Slurm: Wiki Pages
A quick read can be found under: Slurm: Getting Started-Jobs and Nodes
ARCC also hosts a number of more detailed and specific wiki pages:
Interactive Session: salloc
You’re there doing the work.
Suitable for developing and testing over a few hours.
[]$ salloc -–help []$ man salloc # Lots of options. # The bare minimum. # This will provide the defaults of one node, one core and 1G of memory. []$ salloc –A <project-name> -t <wall-time>
As with other Linux commands, there are typically short and long forms for the options.
-A
vs--account
and-t
vs--time
.
Format for:
-t/--time
: Acceptable time formats include "minutes
", "minutes:seconds
", "hours:minutes:seconds
", "days-hours
", "days-hours:minutes
" and "days-hours:minutes:seconds
".
Interactive Session: salloc
: workshop
You’ll only use the reservation
for this (and/or other) workshop.
Once you have an account you typically do not need it.
But there are use cases when we can create a specific reservation for you.
Which itself might require a partition
to be defined if you’re using a GPU node (more about that later).
# CPU only compute node. []$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> # GPU partition/compute node. []$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name>
Interactive Session: squeue
: What’s happening?
Use the squeue
command to find a list of jobs currently pending/running.
This list can be 10s/100s/1000s of lines long.
Use the -u
option with your <username>
to specifically look at your jobs.
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526337 salloc: Nodes m233 are ready for job # Make a note of the job id. # Notice the server/node name has changed. [arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 0:19 1 m233 # For an interactive session: Name = interact # You have the command-line interactively available to you. []$ ... []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 1:03 1 m233 # Session will automatically time out []$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked. slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT *** exit srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Interactive Session: salloc
: Finished Early?
If you finish using an salloc
job before the wall time you requested, you can exit
from the command line.
This will stop the interactive session and release its associated resources back to the cluster and make them available for pending jobs.
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526338 salloc: Nodes m233 are ready for job [arcc-t05@m233 ...]$ Do stuff… []$ exit exit salloc: Relinquishing job allocation 13526338
Closing the session will also release the job.
Submit Jobs: sbatch
You submit a job to the queue and walk away.
Monitor its progress/state using command-line and/or email notifications.
Once complete, come back and analyze results.
Submit Jobs: sbatch
: Example
The following is an example bash submission script that we will use to submit a job to the cluster.
It uses a short test python file defined here: python script.
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT put a comment after the shebang, this will cause an error. #SBATCH --account=<project-name> # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/13.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end
As with
salloc
, a submission script must at a minimum have an#SBATCH --account
and#SBATCH --time
defined.Notice we are using the long forms in the example above.
Submit Jobs: squeue
: What’s happening?
Remember: Use the squeue
command to find a list of your jobs currently pending/running.
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:17 1 m233
By default, an output of the form:
slurm-<job-id>.out
will be generated.You can view this file while the job is still running. Only view, do not edit.
Submit Jobs: squeue
: What’s happening? Continued
The squeue
command only shows pending and running jobs.
If a job is no longer in the queue then it has finished.
[]$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:29 1 m233 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36
Finished can mean success, failure, timeout, out-of-memory... It’s just no longer running.
More squeue
Information
For more information see the main Slurm squeue page and use.
[]$ squeue --help []$ man squeue
For example, using the --Format
option you can display additional columns, such as how much time is left from your requested wall time using TimeLeft
:
[]$ squeue -u <username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" ACCOUNT USER JOBID SUBMIT_TIME START_TIME TIME_LEFT <project-name> <username> 1795458 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51 <project-name> <username> 1795453 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 <project-name> <username> 1795454 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 ...
There are various other time related columns:
SubmitTime
: The time that the job was submitted at.StartTime
: Actual or expected start time of the job or job step. This will be different than the submit time if your job has been pending in the queue.TimeLeft
: Time left for the job to execute. This value is calculated by subtracting the job's time used from its time limit.TimeLimit
: Time limit for the job.TimeUsed
: Time used by the job.EndTime
: The time of job termination, actual or expected.
There are lots of other columns that can be defined including ones related to resources (nodes, cores, memory) that have been specifically allocated.
Submission from your Current Working Directory
Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD.
By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD.
In the example above, if I called sbatch run.sh
from ~/intro_to_modules/
then the Python script should reside within this folder. Any output will be written into this folder.
Within the submission script you can define paths (absolute/relative) to other locations.
You can submit a script from any of your allowed locations /home
, /project
and/or /gscratch
.
But you need to manage and describe paths to scripts, data, output appropriately.
Submit Jobs: scancel
: Cancel?
If you have submitted a job, and for what ever reason you want/need to stop it early, then use scancel <job-id>
.
This will stop the job at its current point within the computation, and return any associated resources back to the cluster.
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh <username> R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen.
Submit Jobs: sacct:
What happened?
Use the sacct
command to view your jobs that have completed.
By default this will only list jobs from mid night of the that day.
View the -S, --starttime
(and -E, --endtime=<end_time>
) options to understand how to define a start (and end) time to configure different date/time intervals.
It too has a --format
option allowing you to display additional columns:
[]$ sacct -u <username> -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0 13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0 13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0 13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 []$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28 13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06 13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01 13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09
For more information see the main Slurm sacct page and use:
[]$ sacct --help []$ man sacct
Submit Jobs: sbatch
: Options
Here are some of the common options available:
[]$ sbatch –-help #SBATCH --account=<prohect-name> # Required: account/time #SBATCH --time=72:00:00 #SBATCH --job-name=workshop # Job name: Help to identify when using squeue. #SBATCH --nodes=1 # Options will typically have defaults. #SBATCH --tasks-per-node=1 # Request resources in accordance to how you want #SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition #SBATCH --partition=mb # and if you require a GPU. #SBATCH --gres=gpu:1 #SBATCH --mem=100G # Request specific memory needs. #SBATCH --mem-per-cpu=10G #SBATCH --mail-type=ALL # Get email notifications of the state of the job. #SBATCH --mail-user=<email-address> #SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id.
Submit Jobs: sbatch
: Options: Applied to Example
Let’s take the previous example, and add some of the additional options:
#!/bin/bash #SBATCH --account=<project-name> #SBATCH --time=10:00 #SBATCH --reservation=<reservation-name> #SBATCH --job-name=pytest #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address> #SBATCH --output=slurms/pyresults_%A.out echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/13.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end
Notice:
I’ve given the job a specific name and have requested email notifications.
The output is written to a sub folder
slurm/
with a name of the formpytest_<jobid>.out
Extended Example: What Does the Run look Like?
With the above settings (written into a file called run.sh
), a submission will look something like the following:
In my inbox, I also received two emails with the subjects:
medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00
This will have no text within the email body.
medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0
The body of this email contained the
seff
results.
| Workshop Home | Next |