Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
Workload Managers
Allocates access to appropriate computer nodes specific to your requests.
Framework for starting, executing, monitoring, and even canceling your jobs.
Queue management and job state notification.
ARCC: Slurm: Wiki Pages
Slurm Related Commands
Core hour usage: chu_user, chu_account
Interactive Session: salloc
You’re there doing the work.
Suitable for developing and testing over a few hours.
[]$ salloc -–help # Lots of options. # Notice short and long form options. []$ salloc –A <project-name> -t <wall-time> # Format for: --time: Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
Interactive Session: salloc
: workshop
You’ll only use the
reservation
for this (and/or other) workshop.Once you have an account you typically do not need it.
But there are use cases when we can create a specific reservation for you.
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name>
Interactive Session: salloc
: What’s happening?
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526337 salloc: Nodes m233 are ready for job # Make a note of the job id. # Notice the server/node name has changed. [arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 0:19 1 m233 # For an interactive session: Name = interact # You have the command-line interactively available to you. []$ ... []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 1:03 1 m233 # Session will automatically time out []$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked. slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT *** exit srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Interactive Session: salloc
: Finished Early?
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526338 salloc: Nodes m233 are ready for job [arcc-t05@m233 ...]$ Do stuff… []$ exit exit salloc: Relinquishing job allocation 13526338
Submit Jobs: sbatch
You submit a job to the queue and walk away.
Monitor its progress/state using command-line and/or email notifications.
Once complete, come back and analyze results.
Submit Jobs: sbatch
: Template
#!/bin/bash # Shebang indicating this is a bash script. #SBATCH --account=arccanetrain # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module load gcc/12.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end
Submit Jobs: squeue
: What’s happening?
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out # You can view this file while the job is still running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:17 1 m233
Submit Jobs: squeue
: What’s happening Continued?
[]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:29 1 m233 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # squeue only shows pending and running jobs. # If a job is no longer in the queue then it has finished. # Finished can mean success, failure, timeout... It’s just no longer running. []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36
Submit Jobs: scancel
: Cancel?
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh arcc-t05 R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
Submit Jobs: sacct:
What happened?
[]$ sacct -u arcc-t05 -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0 13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0 13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0 13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 # Lots more information []$ sacct --help []$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28 13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06 13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01 13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09
Submit Jobs: sbatch
: Options
[]$ sbatch –-help #SBATCH –-account=arccanetrain # Required: account/time #SBATCH –-time=72:00:00 #SBATCH –-job-name=workshop # Job name: Help to identify when using squeue. #SBATCH –-nodes=1 # Options will typically have defaults. #SBATCH –-tasks-per-node=1 # Request resources in accordance to how you want #SBATCH –-cpus-per-task=1 # to parallelize your job, type of hardware partition #SBATCH –-partition=teton-gpu # and if you require a GPU. #SBATCH –-gres=gpu:1 #SBATCH –-mem=100G # Request specific memory needs. #SBATCH –-mem-per-cpu=10G #SBATCH –-mail-type=ALL # Get email notifications of the state of the job. #SBATCH –-mail-user=<email-address> #SBATCH –-output=<prefix>_%A.out # Define a named output file postfixed with the job id.