Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
...
Info |
---|
You’ll only use the reservation for this (and/or other) workshop. Once you have an account you typically do not need it. But there are use cases when we can create a specific reservation for you.
|
Code Block |
---|
[]$ salloc –A arccanetrain<project-name> –t 1:00 --reservation=<reservation-name> |
...
Interactive Session: squeue
: What’s happening?
Code Block |
---|
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
...
Interactive Session: salloc
: Finished Early?
Code Block |
---|
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338 |
...
Info |
---|
You submit a job to the queue and walk away. Monitor its progress/state using command-line and/or email notifications. Once complete, come back and analyze results.
|
...
Submit Jobs: sbatch
...
: Example
Info |
---|
The following is an example script that we will use to submit a job to the cluster. It uses a short test python file defined here: python script. |
Code Block |
---|
#!/bin/bash
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=arccanetrain <project-name> # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00 # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables.
start=$(date +'%D %T') # Can call bash commands.
echo "Start:" $start
module loadpurge
module load gcc/1213.2.0 python/3.10.6 # Load the modules you require for your environment.
python python01.py # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end |
...
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u arcc-t05<username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh arcc-t05<username> R 0:05 1 m233
[]$ ls
python01.py run.sh slurm-13526340.out
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
[]$ squeue -u arcc-t05<username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh arcc-t05<username> R 0:17 1 m233 |
Info |
---|
By default, an output of the form: slurm-<job-id>.out will be generated. You can view this file while the job is still running. Only view, do not edit.
|
...
Submit Jobs: squeue
: What’s happening? Continued
...
Code Block |
---|
[]$ squeue -u arcc-t05<username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh arcc-t05<username> R 0:29 1 m233
[]$ squeue -u arcc-t05<username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36 |
Info |
---|
The squeue command only shows pending and running jobs. If a job is no longer in the queue then it has finished. Finished can mean success, failure, timeout... It’s just no longer running.
|
...
...
More squeue
Information
The main Slurm squeue page.
Code Block |
---|
# Lots more information
[]$ sbatch run.sh
Submitted batch job 13526341
squeue --help
[]$ man squeue
-u
arcc-t05# Display more columns:
# For example how much time is left of your JOBIDrequested PARTITIONwall time: TimeLeft
squeue -u NAME USER ST <username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
[]$ squeue -u <username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
ACCOUNT TIME NODES NODELIST(REASON) USER 13526341 moran run.sh arcc-t05 JOBID R 0:03 1SUBMIT_TIME m233 []$ scancel 13526341 []$ squeue -u arcc-t05 START_TIME TIME_LEFT
<project-name> JOBID PARTITION <username> NAME USER ST1795458 TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Info |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
Submit Jobs: sacct:
What happened?
Code Block |
---|
[]$ sacct -u arcc-t05 -X
JobID 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
<project-name> <username> 1795453 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
<project-name> <username> JobName Partition1795454 Account AllocCPUS State ExitCode
2024------------ ---------- ---------- ---------- ---------- ---------- --------
13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0
13526338 interacti+ moran arccanetr+08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
... |
...
Submission from your Current Working Directory
Info |
---|
Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD. By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD. In the example above, if I called sbatch run.sh from ~/intro_to_modules/ then the Python script should reside within this folder. Any output will be written into this folder. Within the submission script you can define paths (absolute/relative) to other locations. |
Info |
---|
You can submit a script from any of your allowed locations /home , /project and/or /gscratch . But you need to manage and describe paths to scripts, data, output appropriately. |
...
Submit Jobs: scancel
: Cancel?
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526341 moran run.sh <username> R 0:03 1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Info |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
...
Submit Jobs: sacct:
What happened?
The main Slurm sacct page.
Code Block |
---|
[]$ sacct -u <username> -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0
13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0
13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0
13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0
# Lots more information
[]$ sacct --help
[]$ man sacct
# Display more columns:
[]$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28
13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06
13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01
13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
...
Submit Jobs: sbatch
: Options
Info |
---|
Here are some of the common options available: |
Code Block |
---|
[]$ sbatch –-help
#SBATCH --account=<prohect-name> # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop # Job name: Help to identify when using squeue.
#SBATCH --nodes=1 # Options will typically have defaults.
#SBATCH --tasks-per-node=1 # Request resources in accordance to how you want
#SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition
#SBATCH --partition=mb 1# and COMPLETEDif you require a GPU.
#SBATCH 0:0
13526340--gres=gpu:1
#SBATCH --mem=100G run.sh moran arccanetr+ # Request specific memory 1needs.
COMPLETED #SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL 0:0 13526341 run.sh # Get email morannotifications arccanetr+of the state of the job.
#SBATCH --mail-user=<email-address>
1 CANCELLED+#SBATCH --output=<prefix>_%A.out 0:0 # LotsDefine a morenamed informationoutput []$file sacctpostfixed --help
[]$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID Partition NNodes NodeListwith the job id. |
Info |
---|
Both salloc and sbatch have 10s of options, in both short and long form. Some options mimic functionality, for example -G works the same as --gres=gpu:1 . Please consult the command --help and man pages and/or web links to discover further options not listed.
|
...
Submit Jobs: sbatch
: Options: Applied to Example
Info |
---|
Let’s take the previous example, and add some of the additional options: |
Code Block |
---|
#!/bin/bash NCPUS ReqMem State Start Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>
#SBATCH --job-name=pytest
#SBATCH --nodes=1 moran
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=ALL m233 1
#SBATCH 1000M--mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out TIMEOUT 2024-03-22T09:35:25
00:01:28
13526338echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can moranaccess Slurm related Environment variables.
start=$(date +'%D %T') 1 m233 # Can call 1bash commands.
echo "Start:" $start
module 1000Mpurge
COMPLETED 2024-03-22T09:37:41 00:00:06
13526340module load gcc/13.2.0 python/3.10.6 # Load the modules you require moranfor your environment.
python python01.py 1 m233 # Call 1your scripts/commands.
sleep 1m
end=$(date 1000M COMPLETED 2024-03-22T09:38:35 00:01:01
13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
Submit Jobs: sbatch
: Options
Info |
---|
Here are some of the common options available: |
Code Block |
---|
[]$ sbatch –-help
#SBATCH --account=arccanetrain # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop # Job name: Help to identify when using squeue.
#SBATCH --nodes=1 +'%D %T')
echo "End:" $end |
...
Extended Example: What Does the Run look Like?
Info |
---|
With the above settings, a submission will look something like the following: |
Expand |
---|
title | Example Flow and Output: |
---|
|
Code Block |
---|
# Submit the job:
[intro_to_modules]$ sbatch run.sh
Submitted batch job 1817260
# Notice the NAME is now 'pytest'
[intro_to_modules]$ squeue -u salexan5
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1817259 | #Optionswilltypicallyhavedefaults.#SBATCH --tasks-per-node=1#Requestresourcesinaccordancetohowyouwant#SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition
#SBATCH --partition=teton-gpu # and if you require a GPU.
#SBATCH --gres=gpu:1
#SBATCH --mem=100G output while the job is running.
# The output is now in a sub folder under slurm/
# It also uses the name 'pyresults_<job_id>.out'
[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
[intro_to_modules]$ squeue -u <username>
JOBID | #Requestspecificmemoryneeds.#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL # Get email notifications of the state of the job.
#SBATCH --mail-user=<email-address>
#SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id. |
Info |
---|
Both salloc and sbatch have 10s of options, in both short and long form. Some options mimic functionality, for example -G works the same as --gres=gpu:1 . Please consult the command --help and man pages and/or web links to discover further options not listed[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 08/14/24 14:49:38 |
|
Info |
---|
In my inbox, I also received two emails with the subjects: medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00
This will have no text within the email body.
medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0
The body of this email contained the seff results.
|
...
...