Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
...
Info |
---|
|
...
Submit Jobs: sbatch
: Template
...
Info |
---|
|
...
...
More squeue
Information
The main Slurm squeue page.
Code Block |
---|
[]$# sbatchLots run.sh Submitted batch job 13526341more information []$ squeue -u arcc-t05help []$ man squeue # Display more columns: # For example how much time is left of your requested wall time: TimeLeft squeue -u arcc-t05 --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" [salexan5@mblog1 ~]$ squeue -u vvarenth --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" ACCOUNT USER JOBID SUBMIT_TIME START_TIME TIME_LEFT arccantrain arcc-t05 1795458 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51 arccantrain arcc-t05 1795453 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 arccantrain arcc-t05 1795454 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 ... |
...
Submit Jobs: scancel
: Cancel?
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh arcc-t05 R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
...
Submit Jobs: sacct:
What happened?
The main Slurm sacct page.
Code Block |
---|
[]$ sacct -u arcc-t05 -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337 interacti+ moran arccanetr+ 1 TIMEOUT 0:0
13526338 interacti+ moran arccanetr+ 1 COMPLETED 0:0
13526340 run.sh moran arccanetr+ 1 COMPLETED 0:0
13526341 run.sh moran arccanetr+ 1 CANCELLED+ 0:0
# Lots more information
[]$ sacct --help
[]$ man sacct
# Display more columns:
[]$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID Partition NNodes NodeList NCPUS ReqMem State Start Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337 moran 1 m233 1 1000M TIMEOUT 2024-03-22T09:35:25 00:01:28
13526338 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:37:41 00:00:06
13526340 moran 1 m233 1 1000M COMPLETED 2024-03-22T09:38:35 00:01:01
13526341 moran 1 m233 1 1000M CANCELLED+ 2024-03-22T09:40:08 00:00:09 |
...