/
HPC System and Job Queries

HPC System and Job Queries

Overview: HPC Information and Compute Job Information

System querying is helpful to understand what is happening with the system. Meaning, what compute jobs are running, storage quotas, job history, etc. This page contains commands and examples of how to find that information.

Common SLURM Commands

The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag) to stipulate specific information that should be included in output.

SQUEUE: Get information about running and queued jobs on the cluster with squeue

This command is used to pull up information about the jobs that currently exist in the SLURM queue. This command run as default will print all running and queued jobs on the cluster listing each job’s job ID, partition, username, job status, number of nodes, and a node list, with the name of the nodes allocated to each job:

squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1000001 inv-arcc myjob_11 user5 R 2-15:39:34 1 mba30-005 1000002 inv-lab2 AIML-CE joeblow R 6-13:02:32 1 mba30-004 1000005 inv-lab2 AIML-CE joeblow R 6-17:31:53 1 mba30-004 1000012 mb interact cowboyjoe R 2-21:28:49 1 mbcpu-010 1000015 mb sys/dash jsmith R 1:05:19 1 mbcpu-001 1000019 mb-a30 sys/dash janesmit R 8:45:36 1 mba30-006 1000022 mb-a30 Script.s doctorm PD 0:00 1 (Resources) 1000025 mb-a30 Script.22 doctorz R 7:05:44 1 mba30-001 1000028 mb-h100 sys/dash mmajor PD 0:00 1 (Resources) 1000033 mb-h100 sys/dash mmajor PD 0:00 1 (Priority) 1000037 mb-h100 sys/dash kjohnson PD 0:00 1 (Priority) 1000041 mb-h100 sys/dash kjohnson PD 0:00 1 (Priority) 1000045 mb-h100 sys/dash mmajor R 2-02:25:37 1 mbh100-003 1000058 mb-l40s Script.se doctorz R 1-00:58:25 1 mbl40s-003 1000062 teton C1225-TT user17 R 3-19:54:48 1 t507 1000065 teton C1225-TT user17 R 4-17:36:26 1 t502

Helpful flags when calling squeue to tailor your query

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

me

To get a printout with just your jobs

n/a

n/a

--me

The --me flag, will print the squeue info, specifically about jobs submitted by you:

[jsmith@mblog1 ~]$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1000002 inv-lab2 AIML-CE jsmith R 6-13:02:32 1 mba30-004 1000005 inv-lab2 AIML-CE jsmith R 6-17:31:53 1 mba30-004

user

To get a printout of a specific user’s jobs

-u

squeue -u joeblow

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

[jsmith@mblog1 ~]$ squeue --user=joeblow JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1000002 inv-lab2 AIML-CE joeblow R 6-13:02:32 1 mba30-004 1000005 inv-lab2 AIML-CE joeblow R 6-17:31:53 1 mba30-004

long

To get a printout of jobs including wall time

-l

squeue -l

--long

The --long flag (shown in the expandable example below) will print the above information as well as the wall time requested for the job.

squeue --long Mon Jan 1 12:55:23 2020 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 1000001 inv-arcc myjob_11 user5 RUNNING 2-15:39:34 3-00:00:00 1 mba30-005 1000002 inv-lab2 AIML-CE joeblow RUNNING 6-13:11:23 7-00:00:00 1 mba30-004 1000005 inv-lab2 AIML-CE joeblow RUNNING 6-17:31:53 7-00:00:00 1 mba30-004 1000012 mb interact cowboyjoe RUNNING 2-21:28:49 3-00:00:00 1 mbcpu-010 1000015 mb sys/dash jsmith RUNNING 1:05:19 5:00:00 1 mbcpu-001 1000019 mb-a30 sys/dash janesmit RUNNING 8:45:36 4-09:00:00 1 mba30-006 1000022 mb-a30 Script.s doctorm PENDING 0:00 1-00:00:00 1 (Resources) 1000025 mb-a30 Script.22 doctorz RUNNING 7:05:44 1-00:00:00 3 mba30-001 mba30-002 mba30-003 1000028 mb-h100 sys/dash mmajor PENDING 0:00 1-00:00:00 1 (Resources) 1000033 mb-h100 sys/dash mmajor PENDING 0:00 1:00:00 1 (Priority) 1000037 mb-h100 sys/dash kjohnson PENDING 0:00 5:00:00 1 (Priority) 1000041 mb-h100 sys/dash kjohnson PENDING 0:00 2:00:00 1 (Priority) 1000045 mb-h100 sys/dash mmajor RUNNING 2-02:25:37 3-00:00:00 1 mbh100-003 1000058 mb-l40s Script.se doctorz RUNNING 1-00:58:25 2-00:00:00 1 mbl40s-003 1000062 teton C1225-TT user17 RUNNING 3-19:54:48 5-00:00:00 1 t507 1000065 teton C1225-TT user17 RUNNING 4-17:36:26 7-00:00:00 1 t502

format

To get squeue printout with specified format & output

-o

squeue -o Account,UserName,JobID,SubmitTime,StartTime,TimeLeft

--format

If appended with the --format flag, squeue info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run squeue --helpFormat to get a list of SLURM’s recognized column names)

[user17@mblog1 ~]$ squeue --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft" Mon Jan 1 12:55:23 2020 ACCOUNT USER JOBID SUBMIT_TIME START_TIME TIME_LEFT deeplearnlab user17 1000062 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51 deeplearnlab user17 1000091 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49 deeplearnlab user17 1000099 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49

** you can also run squeue --help to get a comprehensive list of flags available to run with the squeue command

SACCT: Get information about recent or completed jobs on the cluster with sacct

The default sacct command: This print a list of your recent or recently completed jobs

[user17@mblog1 ~] sacct JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  ------------ ---------- ---------- ---------- ---------- ---------- --------  1000000      sys/dashb+         mb     aiproject     4    COMPLETED      0:0  1000000.bat+      batch                aiproject     4    COMPLETED      0:0  1000000.ext+     extern                aiproject   4    COMPLETED      0:0  1000003      sys/dashb+         mb     aiproject     8    RUNNING       0:0  1000003.bat+      batch                aiproject     8    RUNNING       0:0  1000003.ext+     extern                aiproject   8    RUNNING       0:0 

Helpful flags when calling sacct to tailor your query

Flag

Use this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

Flag

Use this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

job

To get info about specific job#(s)

-j

sacct -j 1000013

--jobs

[user05@mblog1 ~] sacct --jobs=100013,100025 JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  ------------ ---------- ---------- ---------- ---------- ---------- --------  1000013      sys/dashb+         mb     mlproject     4    TIMEOUT      0:0  1000013.bat+      batch                mlproject     4    CANCELLED    0:15  1000013.ext+     extern                mlproject   4    COMPLETED      0:0  1000025      sys/dashb+         mb     mlproject     8    RUNNING     0:0  1000025.bat+      batch                mlproject     8    RUNNING     0:0  1000025.ext+     extern                mlproject   8    RUNNING     0:0 

batch script

To view batch / submission script for a specific job

-B

sacct -j 1000101 -B

--batch-script

You must specify a job with the --jobs or -j flag to use the -B or --batch-script flag and see it’s associated batch / submission script. This will not work on interactive jobs run from an salloc command, or jobs that were not called from a script.

[user05@mblog1 ~] sacct -j 1000101 --batch-script Batch Script for 1000101 --------------------------------------------------------------------- #!/bin/bash #SBATCH --account=extrememl #SBATCH --time=1:00:00 #SBATCH --mail-user=johnsmith@uwyo.edu #SBATCH --mail-type=all # Clear out and then load necessary software module purge module load gcc/14.2.0 r/4.4.0 # Browse to my project folder cd /project/myprojdir/johnsmith/scripts/ # Export useful connection variables export $HOSTNAME # Run my code R myscript.R

user

To get a printout of a specific user’s jobs

-u

sacct -u joeblow

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

[joeblow@mblog1 ~]$ sacct --user=joeblow JobID JobName Partition Account AllocCPUs State ExitCode ------- ------- --------- --------- --------- ------- -------- 1000002 AIML-CE mb extremeai 4 RUNNING 0:0 1000005 AIML-CE mb extremeai 4 RUNNING 0:0

start

To get a printout of job(s) starting after a date/time

-S

sacct -S 2024-11-01

--start

Dates and times should be specified with format YYYY-MM-DD-HH:MM

[user05@mblog1 ~] sacct --start=2024-11-01 JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  ------------ ---------- ---------- ---------- ---------- ---------- --------  1000013      sys/dashb+         mb     mlproject     4    TIMEOUT      0:0  1000013.bat+      batch                mlproject     4    CANCELLED    0:15  1000013.ext+     extern                mlproject   4    COMPLETED      0:0  1000025      sys/dashb+         mb     mlproject     8    RUNNING     0:0  1000025.bat+      batch                mlproject     8    RUNNING     0:0  1000025.ext+     extern                mlproject   8    RUNNING     0:0 

end

To get a printout of job(s) ending before a given date/time

-E

sacct -E 2024-11-24:12:00:00

--end

 

Dates and times should be specified with format YYYY-MM-DD-HH:MM

[user05@mblog1 ~] sacct --start=2024-11-01 --end=2024-11-24 JobID           JobName  Partition    Account  AllocCPUS      State ExitCode  ------------ ---------- ---------- ---------- ---------- ---------- --------  1000013      sys/dashb+         mb     mlproject     4    TIMEOUT      0:0  1000013.bat+      batch                mlproject     4    CANCELLED    0:15  1000013.ext+     extern                mlproject   4    COMPLETED      0:0  1000025      sys/dashb+         mb     mlproject     8    RUNNING     0:0  1000025.bat+      batch                mlproject     8    RUNNING     0:0  1000025.ext+     extern                mlproject   8    RUNNING     0:0 

 

format

To get sacct printout with specified format & output

-O

sacct -O Account,JobID

--format

If appended with the --format flag, sacct info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sacct --helpformat to get a list of SLURM’s recognized column names)

[user17@mblog1 ~]$ sacct --Format="Account,JobID" ACCOUNT JOBID ------------ ----------- deeplearnlab 1000062 deeplearnlab 1000091 deeplearnlab 1000099

submit line

To view the submit command for a specified job

-o SubmitLine

sacct -o SubmitLine -j 1000101

--format=SubmitLine

This is a way of using the --format flag from above to see a print out of the command your entered to submit the specified job after the -j flag.

[user11@mblog1 ~]$ sacct --format=SubmitLine -j 1000324           SubmitLine  --------------------    sbatch main_job.sh 

WorkDir

To view the working directory used by the job to execute commands

-o WorkDir

sacct -o WorkDir -j 1000101

--format=WorkDir

[user11@mblog1 ~]$ sacct --format=WorkDir -j 1000324           WorkingDir  --------------------    /project/deeplearnlab/ 

My Job Failed. What Do these Exit Codes Mean?

Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:

Exit Code

Likely Cause

Exit Code

Likely Cause

0

The job ran successfully

Any non-zero value

The job failed in some form or another

1

A general failure

2

Something was wrong with a shell command in the script

3 and above

Job error associated with software commands (check software specific exit codes)

0:9

The job was cancelled (usually the user or Slurm/System)

0:15

The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime)

0:53

Some file or directory referenced in the script was not readable or writable

0:125

Job ran out of memory

Anything else

Contact arcc-help@uwyo.edu to have us investigate

 

** you can also run sacct --help to get a comprehensive list of flags available to run with the sacct command

SINFO: Get information about cluster nodes and partitions

The default sinfo command: This print a list of all partitions, their states, availability, and associated nodes on the cluster

[user1@mblog2 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mb* up 7-00:00:00 1 mix mbcpu-007 mb* up 7-00:00:00 24 alloc mbcpu-[001-006,008-025] mb-a30 up 7-00:00:00 1 maint mba30-008 mb-a30 up 7-00:00:00 3 mix mba30-[002,004,006] mb-a30 up 7-00:00:00 1 alloc mba30-005 mb-a30 up 7-00:00:00 3 idle mba30-[001,003,007] mb-l40s up 7-00:00:00 1 maint vl40s-002 mb-l40s up 7-00:00:00 1 resv mbl40s-004 mb-l40s up 7-00:00:00 3 mix mbl40s-[001-003] mb-l40s up 7-00:00:00 1 idle mbl40s-007 mb-h100 up 7-00:00:00 1 drain$ mbh100-004 mb-h100 up 7-00:00:00 4 mix mbh100-[001-003,005] mb-a6000 up 7-00:00:00 1 mix mba6000-001 wildiris up 7-00:00:00 5 idle wi[001-005] teton up 7-00:00:00 1 drain t286 teton up 7-00:00:00 3 mix t[460,502,507] teton up 7-00:00:00 24 idle t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02],ttest[01-02] beartooth up 7-00:00:00 1 idle b523 inv-arcc up infinite 1 alloc mbcpu-025 inv-arcc up infinite 2 idle ttest[01-02] inv-inbre up 7-00:00:00 1 drain t286 inv-inbre up 7-00:00:00 2 mix t[502,507] inv-inbre up 7-00:00:00 1 alloc mbcpu-009 inv-inbre up 7-00:00:00 24 idle b523,mbl40s-007,t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02] inv-ssheshap up 7-00:00:00 1 mix mba6000-001 inv-wysbc up 7-00:00:00 1 alloc mbcpu-001 inv-wysbc up 7-00:00:00 1 idle mba30-001 inv-soc up 7-00:00:00 1 mix mbl40s-001 inv-wildiris up 7-00:00:00 5 idle wi[001-005] inv-klab up 7-00:00:00 3 mix mba30-[002,004],mbcpu-007 inv-klab up 7-00:00:00 6 alloc mba30-005,mbcpu-[002-006] inv-klab up 7-00:00:00 1 idle mba30-003 inv-dale up 7-00:00:00 1 alloc mbcpu-008 inv-wsbc up 7-00:00:00 1 mix mba30-006 inv-wsbc up 7-00:00:00 1 alloc mbcpu-010 non-investor up 7-00:00:00 1 mix t460 non-investor up 7-00:00:00 14 alloc mbcpu-[011-024]

Helpful flags when calling sinfo to tailor your query

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

state

Shows any nodes in state(s) specified

-t

sinfo -t reserved

--states

The --states flag, will print the sinfo, listing nodes (if any) in the specified state and the number of nodes from each partition in the state. If none in a partition are in the state, the number of nodes will be 0 for that partition’s line.

[jsmith@mblog1 ~]$ sinfo --states=mixed PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mb* up 7-00:00:00 0 n/a mb-a30 up 7-00:00:00 3 mix mba30-[002,004,006] mb-l40s up 7-00:00:00 3 mix mbl40s-[001-003] mb-h100 up 7-00:00:00 4 mix mbh100-[001-003,005] mb-a6000 up 7-00:00:00 1 mix mba6000-001 wildiris up 7-00:00:00 0 n/a teton up 7-00:00:00 3 mix t[460,502,507] beartooth up 7-00:00:00 0 n/a inv-arcc up infinite 0 n/a inv-inbre up 7-00:00:00 2 mix t[502,507] inv-ssheshap up 7-00:00:00 1 mix mba6000-001 inv-wysbc up 7-00:00:00 0 n/a inv-soc up 7-00:00:00 1 mix mbl40s-001 inv-wildiris up 7-00:00:00 0 n/a inv-klab up 7-00:00:00 2 mix mba30-[002,004] inv-dale up 7-00:00:00 0 n/a inv-wsbc up 7-00:00:00 1 mix mba30-006 non-investor up 7-00:00:00 1 mix t460

format

To get sinfo printout with specified format & output

-O

sinfo -O NodeAddr,AllocatedMem,Cores

--Format

If appended with the --Format flag, sinfo info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sinfo --helpFormat to get a list of SLURM’s recognized column names)

[user17@mblog1 ~]$ sinfo --Format="AllocMem,AllocNodes,Available,Cores,CPus,CPUsLoad,Disk,Gres,Nodes,Memory" ALLOCMEM ALLOCNODES AVAIL CORES CPUS CPU_LOAD TMP_DISK GRES NODES MEMORY 886016 all up 48 96 90.25 0 (null) 1 1023575 924576 all up 48 96 96.06-96.12 0 (null) 5 1023575 511296 all up 48 96 95.84 0 (null) 1 1023575 393216 all up 48 96 96.45-96.56 0 (null) 2 1023575 588096 all up 48 96 89.97 0 (null) 1 1023575 570336 all up 48 96 96.31-96.43 0 (null) 3 1023575 629376 all up 48 96 96.23-96.40 0 (null) 5 1023575 514912 all up 48 96 92.31 0 (null) 1 1023575 688416 all up 48 96 96.33 0 (null) 1 1023575 798304 all up 48 96 93.06 0 (null) 1 1023575 857344 all up 48 96 93.08 0 (null) 1 1023575 865536 all up 48 96 96.10-96.25 0 (null) 2 1023575 806496 all up 48 96 96.23 0 (null) 1 1023575 102400 all up 48 96 42.22 0 gpu:a30:8 1 765525 208896 all up 48 96 82.04 0 gpu:a30:8 1 765525 524288 all up 48 96 0.02 0 gpu:a30:8 1 765525 49152 all up 48 96 585.36 0 gpu:a30:8 1 765525 0 all up 48 96 0.00-0.02 0 gpu:a30:8 4 765525 0 all up 12 12 0.00 0 gpu:l40s:1 1 75469 0 all up 48 96 0.00 0 gpu:l40s:8 1 765525 524288 all up 48 96 4.41-5.24 0 gpu:l40s:8 2 765525 262144 all up 48 96 2.43 0 gpu:l40s:8 1 765525 0 all up 48 96 0.00 0 gpu:l40s:4 1 765525 0 all up 48 96 0.35 0 gpu:h100:8 1 1281554 524288 all up 48 96 0.26-12.20 0 gpu:h100:8 4 1281554 262144 all up 32 64 6.03 0 gpu:a6000:4 1 1023575 0 all up 14+ 28+ 0.00-0.01 0 (null) 30 119962+ 0 all up 28 56 0.00 0 gpu:a30:2 1 1020129 32768 all up 16 32 15.17 0 (null) 1 128000 30720 all up 20 40 2.00-2.02 0 (null) 2 184907

SEFF: Analyze the efficiency of a completed job with seff

Below will just provide a short breakdown for using the seff command. Please see this page for a great and detailed description of how one could evaluate their job’s performance and efficiency.

The seff command will provide information about cpu and memory efficiency of your job, when provided a valid job number as the argument with seff <job#>. This information is only accurate assuming the job has completed successfully. Any jobs that are still running, or that complete with an out-of-memory error or other errors will have inaccurate seff output.

[]$ seff 10001001 Job ID: 10001001 Cluster: Medicinebow User/Group: jsmith/mycoolproject State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:05 CPU Efficiency: 27.78% of 00:00:18 core-walltime Job Wall-clock time: 00:00:18 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)

 

Slurm Workload Manager

ARCCJOBS: Get a report of jobs currently running on the cluster

arccjobs shows a summary of jobs, cpu resources, and requested/used cpu time. It doesn't take any arguments or options.

$ arccjobs =============================================================================== Account Running Pending User jobs cpus cpuh jobs cpus cpuh =============================================================================== advanceddl 1 1 1.09 0 0 0.00 joeblow 1 1 1.09 0 0 0.00 arcc 2 8 22.01 0 0 0.00 arcchelper1 1 4 5.61 0 0 0.00 arccstaff2 1 4 16.40 0 0 0.00 llmproj 6 192 5229.23 2 64 10752.00 user1 4 128 4769.78 0 0 0.00 johnsmith 2 64 459.45 2 64 10752.00 physicsclass 2 13 16.34 0 0 0.00 student5 1 12 15.22 0 0 0.00 classta2 1 1 1.12 0 0 0.00 researchlab 14 613 882.26 2 120 9600.00 gradresrcher1 2 9 5.82 0 0 0.00 researcher18 12 604 876.43 2 120 9600.00 ....(CONT) =============================================================================== TOTALS: 25 827 41597.79 320 500 22248.00 =============================================================================== Nodes 39/79 (49.37%) Cores 2514/5492 (45.78%) Memory (GB) 16025/60278 (26.58%) CPU Load 2591.46 (47.19%) ===============================================================================

ARCCQUOTA: Get a report of your common HPC data storage locations and usage

arccquota shows information relating to storage quotas. By default, this will display $HOME and $SCRATCH quotas first, followed by the user's associated project quotas. This is a change on Teton from Mount Moran, but the tool is much more comprehensive. The command takes arguments to do project-only (i.e., no $HOME or $SCRATCH info displayed), extensive listing of users' quotas and usage within project directories, can summarize quotas (i.e., no user-specific usage on project spaces).

[jsmith@mblog1 ~]$ arccquota +----------------------------------------------------------------------+ | arccquota | Block | +----------------------------------------------------------------------+ | Path | Used Limit % | +----------------------------------------------------------------------+ | /home/jsmith | 31.35 GB 50.00 GB 62.71 | | /gscratch/jsmith | 550.44 MB 05.00 TB 00.01 | | /project/awesomeresearchproj | 04.96 GB 05.00 TB 00.10 | +----------------------------------------------------------------------+
[jsmith@mblog1 ~]$ arccquota -u collaboratorfriend +----------------------------------------------------------------------+ | arccquota | Block | +----------------------------------------------------------------------+ | Path | Used Limit % | +----------------------------------------------------------------------+ | /home/collaboratorfriend | 49.55 GB 50.00 GB 99.20 | | /gscratch/collaboratorfriend | 5.4 MB 05.00 TB 00.00 | | /project/awesomeresearchproj | 04.96 GB 05.00 TB 00.10 | +----------------------------------------------------------------------+

 

 

Related content