This is a custom Confluence template that is intended to be re-used in the creation of workshops presented by ARCC on the Wiki. All of the content in these sections is intended to be replaced by the author of the workshop. The first step in this style guide is to ensure that the the page is in wide mode to maximize the real estate for content when possible. The Title of the Page should be the same as the Title of the workshop and this section should include a quick intro to the topic, why it’s important for ARCC users, and what users should expect to get out of this workshop. Next should be a Table of Contents macro in vertical format. The Table is intended to be used as an agenda section for presenter mode as well as navigation for non-presenting viewing so that users can find the documentation and navigate to what they need to brush up on. Finally, at the end of each section, there should be a divider to indicate the separation of “slides”
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Headers and Sections
...
Code Examples
Two Column Tables are nice ways to separate content/ Background info along with a code example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing
...
Bullets are nice to include for distinct points
...
yep
...
they
...
sure
...
Code Block |
---|
Please use the "code snippet" in the + button when creating code examples. Also please do not go
past the width of the table. This is to prevent scroll bars appearing
This is the Max number of code lines to show on an example |
Straight Code - No context
...
Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Workload Managers
Info |
---|
|
...
ARCC: Slurm: Wiki Pages
Info |
---|
A quick read can be found under: Slurm: Getting Started-Jobs and Nodes ARCC also hosts a number of more detailed and specific wiki pages: |
...
Interactive Session: salloc
Info |
---|
You’re there doing the work. Suitable for developing and testing over a few hours. |
Code Block |
---|
[]$ salloc -–help
[]$ man salloc
# Lots of options.
# The bare minimum.
# This will provide the defaults of one node, one core and 1G of memory.
[]$ salloc –A <project-name> -t <wall-time> |
Info |
---|
|
...
Interactive Session: salloc
: workshop
Info |
---|
You’ll only use the Once you have an account you typically do not need it. But there are use cases when we can create a specific reservation for you. Which itself might require a |
Code Block |
---|
# CPU only compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name>
# GPU partition/compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name> |
...
Interactive Session: squeue
: What’s happening?
Info |
---|
Use the This list can be 10s/100s/1000s of lines long. Use the |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.
# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 0:19 1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$
...
[]$ squeue -u arcc-t05
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526337 moran interact arcc-t05 R 1:03 1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
...
Interactive Session: salloc
: Finished Early?
Info |
---|
If you finish using an This will stop the interactive session and release its associated resources back to the cluster and make them available for pending jobs. |
Code Block |
---|
[]$ salloc -A <project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338 |
Info |
---|
Closing the session will also release the job. |
...
Exercise: salloc
: Give It A Go
From a login node, create some interactive sessions using: salloc
\.
Try different wall times:
Short times to experience an automatic timeout.
Longer times so you can call
squeue
and see your job in the queue.
Notice how the command-line prompt changes.
...
Submit Jobs: sbatch
Info |
---|
|
...
Submit Jobs: sbatch
: Example
Info |
---|
The following is an example bash submission script that we will use to submit a job to the cluster. It uses a short test python file defined here: python script. |
Code Block |
---|
#!/bin/bash
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=<project-name> # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00 # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables.
start=$(date +'%D %T') # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/14.2.0 python/3.10.6 # Load the modules you require for your environment.
python python01.py # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end |
Note |
---|
|
...
Submit Jobs: squeue
: What’s happening?
Info |
---|
Remember: Use the |
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh <username> R 0:05 1 m233
[]$ ls
python01.py run.sh slurm-13526340.out
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh <username> R 0:17 1 m233 |
Info |
---|
|
...
Submit Jobs: squeue
: What’s happening? Continued
Info |
---|
The If a job is no longer in the queue then it has finished. |
Code Block |
---|
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526340 moran run.sh <username> R 0:29 1 m233
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36 |
Info |
---|
|
...
More squeue
Information
Info | ||
---|---|---|
For more information see the main Slurm squeue page and use.
|
Info |
---|
For example, using the |
Code Block |
---|
[]$ squeue -u <username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
ACCOUNT USER JOBID SUBMIT_TIME START_TIME TIME_LEFT
<project-name> <username> 1795458 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
<project-name> <username> 1795453 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
<project-name> <username> 1795454 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
... |
Info |
---|
There are various other time related columns:
There are lots of other columns that can be defined including ones related to resources (nodes, cores, memory) that have been specifically allocated. |
...
Submission from your Current Working Directory
Info |
---|
Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD. By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD. In the example above, if I called Within the submission script you can define paths (absolute/relative) to other locations. |
Info |
---|
You can submit a script from any of your allowed locations But you need to manage and describe paths to scripts, data, output appropriately. |
...
Submit Jobs: scancel
: Cancel?
Info |
---|
If you have submitted a job, and for what ever reason you want/need to stop it early, then use This will stop the job at its current point within the computation, and return any associated resources back to the cluster. |
Code Block |
---|
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13526341 moran run.sh <username> R 0:03 1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Sep 3 2024, 15:13:56) [GCC 14.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Note |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
...
Submit Jobs: sacct:
What happened?
Info |
---|
Use the By default this will only list jobs from mid night of the that day. View the It too has a |
Code Block |
---|
[]$ sacct -u <username> -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0
13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0
13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0
13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0
[]$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28
13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06
13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01
13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
Info | ||
---|---|---|
For more information see the main Slurm sacct page and use:
|
...
Submit Jobs: sbatch
: Options
Info |
---|
Here are some of the common options available: |
Code Block |
---|
[]$ sbatch –-help
#SBATCH --account=<prohect-name> # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop # Job name: Help to identify when using squeue.
#SBATCH --nodes=1 # Options will typically have defaults.
#SBATCH --tasks-per-node=1 # Request resources in accordance to how you want
#SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition
#SBATCH --partition=mb # and if you require a GPU.
#SBATCH --gres=gpu:1
#SBATCH --mem=100G # Request specific memory needs.
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL # Get email notifications of the state of the job.
#SBATCH --mail-user=<email-address>
#SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id. |
Info |
---|
...
Submit Jobs: sbatch
: Options: Applied to Example
Info |
---|
Let’s take the previous example, and add some of the additional options: |
Code Block |
---|
#!/bin/bash #SBATCH --account=<project-name> #SBATCH --time=10:00 #SBATCH --reservation=<reservation-name> #SBATCH --job-name=pytest #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address> #SBATCH --output=slurms/pyresults_%A.out echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/14.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your This is the end |
Same Thing With Images
...
Two Column Tables are nice ways to separate content/ Background info along with an image example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing
Bullets are nice to include for distinct points
yep
they
sure
are
This is 14 lines
Alternatively No Table
Finally The End
...
Link to Previous sub-module or Home Module
...
scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end |
Info |
---|
Notice:
|
...
Extended Example: What Does the Run look Like?
Info |
---|
With the above settings (written into a file called |
Expand | ||
---|---|---|
| ||
|
Info |
---|
In my inbox, I also received two emails with the subjects:
|
...
Exercise: sbatch
: Give It A Go
Using the script examples (adjust were appropriate) try submitting some jobs.
Once submitted (within a different session) monitor the jobs using the
squeue
command.Track the job ids, and try changing the job name to distinguish when viewing the pending/running jobs.
Cancel some of the jobs.
Maybe try increasing the
sleep
value to be longer than the requested wall time to trigger a timeout.Once they’ve completed, run
sacct
to view the finished jobs, and look at their state.
...
| Workshop Home | Next |