Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
...
Info |
---|
|
Code Block |
---|
[]$ salloc –A arccanetrain<project-name> –t 1:00 --reservation=<reservation-name> |
...
Interactive Session: squeue
: What’s happening?
Code Block |
---|
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526337 salloc: Nodes m233 are ready for job # Make a note of the job id. # Notice the server/node name has changed. [arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 0:19 1 m233 # For an interactive session: Name = interact # You have the command-line interactively available to you. []$ ... []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526337 moran interact arcc-t05 R 1:03 1 m233 # Session will automatically time out []$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked. slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT *** exit srun: Job step aborted: Waiting up to 32 seconds for job step to finish. |
...
Interactive Session: salloc
: Finished Early?
Code Block |
---|
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name> salloc: Granted job allocation 13526338 salloc: Nodes m233 are ready for job [arcc-t05@m233 ...]$ Do stuff… []$ exit exit salloc: Relinquishing job allocation 13526338 |
...
Info |
---|
|
...
Submit Jobs: sbatch
...
: Example
Info |
---|
The following is an example script that we will use to submit a job to the cluster. It uses a short test python file defined here: python script. |
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT put a comment after the shebang, this will cause an error. #SBATCH --account=arccanetrain<project-name> # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module purge module load gcc/1213.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
...
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05<username> R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05<username> R 0:17 1 m233 |
Info |
---|
|
...
Submit Jobs: squeue
: What’s happening? Continued
...
Code Block |
---|
[]$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05<username> R 0:29 1 m233 []$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36 |
...
Code Block |
---|
# Lots more information []$ squeue --help []$ man squeue # Display more columns: # For example how much time is left of your requested wall time: TimeLeft squeue -u arcc-t05<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" [salexan5@mblog1 ~]$ squeue -u vvarenth<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" ACCOUNT USER JOBID SUBMIT_TIME START_TIME TIME_LEFT arccantrain<project-name> <username> arcc-t05 1795458 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51 arccantrain<project-name> <username> arcc-t05 1795453 1795453 20242024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 arccantrain<project-name> arcc-t05 <username> 1795454 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 ... |
...
Info |
---|
Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD. By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD. In the example above, if I called Within the submission script you can define paths (absolute/relative) to other locations. |
Info |
---|
You can submit a script from any of your allowed locations But you need to manage and describe paths to scripts, data, output appropriately. |
...
Submit Jobs: scancel
: Cancel?
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh arcc-t05<username> R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05<username> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
...
The main Slurm sacct page.
Code Block |
---|
[]$ sacct -u arcc-t05<username> -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0 13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0 13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0 13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0 # Lots more information []$ sacct --help []$ man sacct # Display more columns: []$ sacct -u arcc-t05<username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ---------- 13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28 13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06 13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01 13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
...
Code Block |
---|
[]$ sbatch –-help #SBATCH --account=arccanetrain <prohect-name> # Required: account/time #SBATCH --time=72:00:00 #SBATCH --job-name=workshop # Job name: Help to identify when using squeue. #SBATCH --nodes=1 # Options will typically have defaults. #SBATCH --tasks-per-node=1 # Request resources in accordance to how you want #SBATCH --cpus-per-task=1 # to parallelize your job, type of hardware partition #SBATCH --partition=teton-gpumb # and if you require a GPU. #SBATCH --gres=gpu:1 #SBATCH --mem=100G # Request specific memory needs. #SBATCH --mem-per-cpu=10G #SBATCH --mail-type=ALL # Get email notifications of the state of the job. #SBATCH --mail-user=<email-address> #SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id. |
Info |
---|
...
Submit Jobs: sbatch
: Options: Applied to Example
Info |
---|
Let’s take the previous example, and add some of the additional options: |
Code Block |
---|
#!/bin/bash
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>
#SBATCH --job-name=pytest
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out
echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables.
start=$(date +'%D %T') # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/13.2.0 python/3.10.6 # Load the modules you require for your environment.
python python01.py # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end |
Info |
---|
Notice:
|
...
Extended Example: What Does the Run look Like?
Info |
---|
With the above settings, a submission will look something like the following: |
Expand | ||
---|---|---|
| ||
|
Info |
---|
In my inbox, I also received two emails with the subjects:
|
...
| Workshop Home | Next |
...