Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

...

Workload Managers 

Info
  1. Allocates access to appropriate computer nodes specific to your requests.

  2. Framework for starting, executing, monitoring, and even canceling your jobs.

  3. Queue management and job state notification.

...

ARCC: Slurm: Wiki Pages 

Info

ARCC also hosts a number of more detailed and specific wiki pages:

...

Slurm Related Commands

Info

...

Interactive Session: salloc

Info
  • You’re there doing the work.

  • Suitable for developing and testing over a few hours.

Code Block
[]$ salloc -–help
[]$ man salloc
# Lots of options. 

# The bare minimum.
# This will provide the defaults of one node, one core and 1G of memory.
[]$ salloc –A <project-name> -t <wall-time>

...

Interactive Session: squeue: What’s happening?

[]$ salloc -A
Code Block
Info

Use the squeue command to find a list of jobs currently pending/running.

This list can be 10s/100s/1000s of lines long.

Use the -u option with your <username> to specifically look at your jobs.

Code Block
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.

# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       0:19      1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$ 
...
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       1:03      1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

...

Interactive Session: salloc: Finished Early?

Info

If you finish using an salloc job before the wall time you requested, you can exit from the command line.

This will stop the interactive session and release its associated resources back to the cluster and make them available for pending jobs.

Code Block
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338

Submit Jobs: sbatch

You
Info
Info

Closing the session will also release the job.

...

Submit Jobs: sbatch

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Example

Info

The following is an example bash submission script that we will use to submit a job to the cluster.

It uses a short test python file defined here: python script.

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=<project-name>          # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/13.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end
Note

As with salloc, a submission script must at a minimum have an #SBATCH --account and #SBATCH --time defined.

...

Submit Jobs: squeue: What’s happening?

Info

Remember: Use the squeue command to find a list of your jobs currently pending/running.

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u <username>
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh <username>  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh <username>  R       0:17      1 m233

...

Submit Jobs: scancel: Cancel?

[]$
Code Block
Info

If you have submitted a job, and for what ever reason you want/need to stop it early, then use scancel <job-id>.

This will stop the job at its current point within the computation, and return any associated resources back to the cluster.

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh <username>  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
Infonote

If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen.

...

Submit Jobs: sacct: What happened?

Info

Use the sacct command to view you jobs that have completed.

By default this will only list jobs from mid night of the that day.

View the -S, --starttime (and -E, --endtime=<end_time>) options to understand how to define a start (and end) time to configure different date/time intervals.

The main Slurm sacct page.

Code Block
[]$ sacct -u <username> -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+          1    TIMEOUT      0:0
13526338     interacti+      moran arccanetr+          1  COMPLETED      0:0
13526340         run.sh      moran arccanetr+          1  COMPLETED      0:0
13526341         run.sh      moran arccanetr+          1 CANCELLED+      0:0

# Lots more information
[]$ sacct --help
[]$ man sacct

# Display more columns:
[]$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodes        NodeList      NCPUS     ReqMem      State               Start    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337          moran        1            m233          1      1000M    TIMEOUT 2024-03-22T09:35:25   00:01:28
13526338          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:37:41   00:00:06
13526340          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:38:35   00:01:01
13526341          moran        1            m233          1      1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

...