Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template

...

Info
  • The squeue command only shows pending and running jobs.

  • If a job is no longer in the queue then it has finished.

  • Finished can mean success, failure, timeout... It’s just no longer running.

...

...

More squeue Information

The main Slurm squeue page.

Code Block
[]$# sbatchLots run.sh
Submitted batch job 13526341more information
[]$ squeue -u arcc-t05help
[]$ man squeue

# Display more columns:
# For example how much time is left of your requested wall time: TimeLeft
squeue -u arcc-t05 --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
[salexan5@mblog1 ~]$ squeue -u vvarenth --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
ACCOUNT             USER                JOBID               SUBMIT_TIME         START_TIME          TIME_LEFT
arccantrain         arcc-t05            1795458             2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
arccantrain         arcc-t05            1795453             2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
arccantrain         arcc-t05            1795454             2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
...

...

Submit Jobs: scancel: Cancel?

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh arcc-t05  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***

...

Submit Jobs: sacct: What happened?

The main Slurm sacct page.

Code Block
[]$ sacct -u arcc-t05 -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+          1    TIMEOUT      0:0
13526338     interacti+      moran arccanetr+          1  COMPLETED      0:0
13526340         run.sh      moran arccanetr+          1  COMPLETED      0:0
13526341         run.sh      moran arccanetr+          1 CANCELLED+      0:0

# Lots more information
[]$ sacct --help
[]$ man sacct

# Display more columns:
[]$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodes        NodeList      NCPUS     ReqMem      State               Start    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337          moran        1            m233          1      1000M    TIMEOUT 2024-03-22T09:35:25   00:01:28
13526338          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:37:41   00:00:06
13526340          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:38:35   00:01:01
13526341          moran        1            m233          1      1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

...