What is Slurm
Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
- 1 Workload Managers
- 2 Interactive Session: salloc
- 3 Exercise: salloc: Give It A Go
- 4 Submit Jobs: sbatch
- 5 Submit Jobs: sbatch: Example
- 6 Submit Jobs: squeue: What’s happening?
- 7 More squeue Information
- 8 Submission from your Current Working Directory
- 9 Submit Jobs: scancel: Cancel?
- 10 Submit Jobs: sacct: What happened?
- 11 Submit Jobs: sbatch: Options
- 12 Submit Jobs: sbatch: Options: Applied to Example
- 13 Extended Example: What Does the Run look Like?
- 14 Exercise: sbatch: Give It A Go
Workload Managers
Allocates access to appropriate computer nodes specific to your requests.
Framework for starting, executing, monitoring, and even canceling your jobs.
Queue management and job state notification.
ARCC: Slurm: Wiki Pages
A quick read can be found under: Slurm: Getting Started-Jobs and Nodes
ARCC also hosts a number of more detailed and specific wiki pages:
Interactive Session: salloc
You’re there doing the work.
Suitable for developing and testing over a few hours.
[]$ salloc -–help
[]$ man salloc
# Lots of options.
# The bare minimum.
# This will provide the defaults of one node, one core and 1G of memory.
[]$ salloc –A <project-name> -t <wall-time>
Interactive Session: salloc
: workshop
# CPU only compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name>
# GPU partition/compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name>
Interactive Session: squeue
: What’s happening?
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Interactive Session: salloc
: Finished Early?
Exercise: salloc
: Give It A Go
Submit Jobs: sbatch
Submit Jobs: sbatch
: Example
Submit Jobs: squeue
: What’s happening?
Submit Jobs: squeue
: What’s happening? Continued
More squeue
Information
Submission from your Current Working Directory
Submit Jobs: scancel
: Cancel?
Submit Jobs: sacct:
What happened?
Submit Jobs: sbatch
: Options
Submit Jobs: sbatch
: Options: Applied to Example
Extended Example: What Does the Run look Like?
Exercise: sbatch
: Give It A Go
| Workshop Home | Next |