Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Workload Managers
Info |
---|
|
...
ARCC: Slurm: Wiki Pages
Info |
---|
ARCC also hosts a number of more detailed and specific wiki pages: |
...
Slurm Related Commands
Info |
---|
|
...
Interactive Session: salloc
Info |
---|
|
Code Block |
---|
[]$ salloc -–help []$ man salloc # Lots of options. # The bare minimum. # This will provide the defaults of one node, one core and 1G of memory. []$ salloc –A <project-name> -t <wall-time> |
...
Interactive Session: squeue
: What’s happening?
Code Block |
---|
Info |
Use the This list can be 10s/100s/1000s of lines long. Use the |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
...
Interactive Session: salloc
: Finished Early?
Info |
---|
If you finish using an This will stop the interactive session and release its associated resources back to the cluster and make them available for pending jobs. |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526338 salloc: Nodes m233 are ready for job [arcc-t05@m233 ...]$ Do stuff… []$ exit exit salloc: Relinquishing job allocation 13526338 |
Submit Jobs: sbatch
Info |
---|
Info |
Closing the session will also release the job. |
...
Submit Jobs: sbatch
Info |
---|
|
...
Submit Jobs: sbatch
: Example
Info |
---|
The following is an example bash submission script that we will use to submit a job to the cluster. It uses a short test python file defined here: python script. |
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT put a comment after the shebang, this will cause an error. #SBATCH --account=<project-name> # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/13.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
Note |
---|
As with |
...
Submit Jobs: squeue
: What’s happening?
Info |
---|
Remember: Use the |
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:17 1 m233 |
...
Submit Jobs: scancel
: Cancel?
Code Block |
---|
Info |
If you have submitted a job, and for what ever reason you want/need to stop it early, then use This will stop the job at its current point within the computation, and return any associated resources back to the cluster. |
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526341 moran run.sh <username> R 0:03 1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Infonote |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
...
Submit Jobs: sacct:
What happened?
Info |
---|
Use the By default this will only list jobs from mid night of the that day. View the The main Slurm sacct page. |
Code Block |
---|
[]$ sacct -u <username> -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0 13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0 13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0 13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 # Lots more information []$ sacct --help []$ man sacct # Display more columns: []$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28 13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06 13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01 13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
...