Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
...
Interactive Session: salloc
Info |
---|
You’re there doing the work. Suitable for developing and testing over a few hours. |
Code Block |
---|
[]$ salloc -–help []$ man salloc # Lots of options. # The bare minimum. # This will provide the defaults of one node, one core and 1G of memory. []$ salloc –A <project-name> -t <wall-time> |
...
Interactive Session: salloc
: workshop
Info |
---|
You’ll only use the Once you have an account you typically do not need it. But there are use cases when we can create a specific reservation for you. Which itself might require a |
Code Block |
---|
# CPU only compute node. []$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> # GPU partition/compute node. []$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name> |
...
Info |
---|
Closing the session will also release the job. |
...
Exercise: salloc
: Give It A Go
From a login node, create some interactive sessions using: salloc
\.
Try different wall times:
Short times to experience an automatic timeout.
Longer times so you can call
squeue
and see your job in the queue.
Notice how the command-line prompt changes.
...
Submit Jobs: sbatch
Info |
---|
|
...
Submit Jobs: sbatch
: Example
...
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT put a comment after the shebang, this will cause an error. #SBATCH --account=<project-name> # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/1314.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
...
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, OctSep 173 20222024, 1615:4713:3256) [GCC 1214.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:17 1 m233 |
...
Submit Jobs: squeue
: What’s happening? Continued
Info |
---|
The If a job is no longer in the queue then it has finished. |
Code Block |
---|
[]$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh <username> R 0:29 1 m233 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, OctSep 173 20222024, 1615:4713:3256) [GCC 1214.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36 |
...
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh <username> R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u <username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Sep Oct 173 20222024, 1615:4713:3256) [GCC 1214.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
...
Code Block |
---|
#!/bin/bash #SBATCH --account=<project-name> #SBATCH --time=10:00 #SBATCH --reservation=<reservation-name> #SBATCH --job-name=pytest #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address> #SBATCH --output=slurms/pyresults_%A.out echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/1314.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
...
Expand | ||
---|---|---|
| ||
|
Info |
---|
In my inbox, I also received two emails with the subjects:
|
...
Exercise: sbatch
: Give It A Go
Using the script examples (adjust were appropriate) try submitting some jobs.
Once submitted (within a different session) monitor the jobs using the
squeue
command.Track the job ids, and try changing the job name to distinguish when viewing the pending/running jobs.
Cancel some of the jobs.
Maybe try increasing the
sleep
value to be longer than the requested wall time to trigger a timeout.Once they’ve completed, run
sacct
to view the finished jobs, and look at their state.
...
| Workshop Home | Next |
...