Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is a custom Confluence template that is intended to be re-used in the creation of workshops presented by ARCC on the Wiki. All of the content in these sections is intended to be replaced by the author of the workshop. The first step in this style guide is to ensure that the the page is in wide mode to maximize the real estate for content when possible. The Title of the Page should be the same as the Title of the workshop and this section should include a quick intro to the topic, why it’s important for ARCC users, and what users should expect to get out of this workshop. Next should be a Table of Contents macro in vertical format. The Table is intended to be used as an agenda section for presenter mode as well as navigation for non-presenting viewing so that users can find the documentation and navigate to what they need to brush up on. Finally, at the end of each section, there should be a divider to indicate the separation of “slides”Goal: Introduction to what is Slurm and how to start interactive sessions, submit jobs and monitor.

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

Headers and Sections

...

Code Examples

Two Column Tables are nice ways to separate content/ Background info along with a code example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing

...

Bullets are nice to include for distinct points

...

yep

...

they

...

sure

...

Code Block
Please use the "code snippet" in the + button when creating code examples. Also please do not go
past the width of the table. This is to prevent scroll bars appearing













This is the Max number of code lines to show on an example

Straight Code - No context

Code Block
Limit to 16 lines in the example. 














This is the end

Same Thing With Images

...

Two Column Tables are nice ways to separate content/ Background info along with an image example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing

  • Bullets are nice to include for distinct points

  • yep

  • they

  • sure

  • are

    This is 14 lines

...

image-20240514-000033.pngImage Removed

Alternatively No Table

image-20240514-000127.pngImage Removed

Finally The End

...

Link to Previous sub-module or Home Module

...

Workload Managers 

  1. Allocates access to appropriate computer nodes specific to your requests.

  2. Framework for starting, executing, monitoring, and even canceling your jobs.

  3. Queue management and job state notification.

...

ARCC: Slurm: Wiki Pages 

...

Slurm Related Commands

...

Interactive Session: salloc

  • You’re there doing the work.

  • Suitable for developing and testing over a few hours.

Code Block
[]$ salloc -–help
# Lots of options. 
# Notice short and long form options.
[]$ salloc –A <project-name> -t <wall-time>
# Format for: --time: Acceptable time formats include "minutes", "minutes:seconds", 
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

...

Interactive Session: salloc: workshop

  • You’ll only use the reservation for this (and/or other) workshop.

  • Once you have an account you typically do not need it.

  • But there are use cases when we can create a specific reservation for you.

Code Block
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name>

...

Interactive Session: salloc: What’s happening?

Code Block
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       0:19      1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$ 
...
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       1:03      1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

...

Interactive Session: salloc: Finished Early?

Code Block
[]$ salloc -A arccanetrain -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338

...

Submit Jobs: sbatch

  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template

Code Block
#!/bin/bash                               # Shebang indicating this is a bash script.
#SBATCH --account=arccanetrain            # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module load gcc/12.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Submit Jobs: squeue: What’s happening?

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out
# You can view this file while the job is still running.
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:17      1 m233

...

Submit Jobs: squeue: What’s happening Continued?

Code Block
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:29      1 m233
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
# squeue only shows pending and running jobs.
# If a job is no longer in the queue then it has finished. 
# Finished can mean success, failure, timeout... It’s just no longer running.
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36

...

Submit Jobs: scancel: Cancel?

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh arcc-t05  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***

...

Submit Jobs: sacct: What happened?

Code Block
[]$ sacct -u arcc-t05 -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+          1    TIMEOUT      0:0
13526338     interacti+      moran arccanetr+          1  COMPLETED      0:0
13526340         run.sh      moran arccanetr+          1  COMPLETED      0:0
13526341         run.sh      moran arccanetr+          1 CANCELLED+      0:0
# Lots more information
[]$ sacct --help
[]$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodes        NodeList      NCPUS     ReqMem      State               Start    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337          moran        1            m233          1      1000M    TIMEOUT 2024-03-22T09:35:25   00:01:28
13526338          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:37:41   00:00:06
13526340          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:38:35   00:01:01
13526341          moran        1            m233          1      1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

...

Submit Jobs: sbatch: Options

Code Block
[]$ sbatch –-help
#SBATCH –-account=arccanetrain          # Required: account/time
#SBATCH –-time=72:00:00
#SBATCH –-job-name=workshop             # Job name: Help to identify when using squeue.
#SBATCH –-nodes=1                       # Options will typically have defaults.
#SBATCH –-tasks-per-node=1              # Request resources in accordance to how you want
#SBATCH –-cpus-per-task=1               # to parallelize your job, type of hardware partition
#SBATCH –-partition=teton-gpu           # and if you require a GPU.
#SBATCH –-gres=gpu:1
#SBATCH –-mem=100G                      # Request specific memory needs.
#SBATCH –-mem-per-cpu=10G
#SBATCH –-mail-type=ALL                 # Get email notifications of the state of the job.
#SBATCH –-mail-user=<email-address>
#SBATCH –-output=<prefix>_%A.out        # Define a named output file postfixed with the job id.

...