| Table of Contents |
|---|
| minLevel | 1 |
|---|
| maxLevel | 6 |
|---|
| outline | false |
|---|
| style | default |
|---|
| type | list |
|---|
| printable | true |
|---|
|
System querying is helpful to understand what is happening with the system. Meaning, what compute jobs are running, storage quotas, job history, etc. This page contains commands and examples of how to find that information.
...
ARCC Specific Commands
...
The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag) to stipulate specific information that should be included in output.
SQUEUE: Get information about running and queued jobs on the cluster with squeue
...
& Queries
ARCCJOBS: Get a report of jobs currently running on the cluster
arccjobs shows a summary of jobs, cpu resources, and requested/used cpu time. It doesn't take any arguments or options.
| Expand |
|---|
| title | Expand to see an example of squeue command run and calling arccjobs and example of output |
|---|
|
| Code Block |
|---|
squeue $ arccjobs
===============================================================================
Account JOBID PARTITION NAME USER ST Running TIME NODES NODELIST(REASON) 1000001 inv-arcc myjob_11 Pending user5 R 2-15:39:34 1 mba30-005
User 1000002 inv-lab2 AIML-CE joeblow R 6-13:02:32 jobs 1 mba30-004cpus cpuh 1000005 jobs inv-lab2 AIML-CE joeblow R 6-17:31:53 cpus 1 mba30-004 1000012 mb interact cowboyjoe R 2-21:28:49cpuh
===============================================================================
advanceddl 1 mbcpu-010 1 1000015 1 mb sys/dash 1.09 jsmith R 1:05:190 1 mbcpu-0010 0.00
1000019joeblow mb-a30 sys/dash janesmit R 8:45:36 1 mba30-006 1 1000022 1.09 mb-a30 Script.s doctorm PD0 0:00 1 (Resources) 0.00
arcc 1000025 mb-a30 Script.22 doctorz R 2 7:05:44 8 1 mba30-001 22.01 10000280 mb-h100 sys/dash 0 mmajor PD 0:.00
arcchelper1 1 (Resources) 1 1000033 mb-h100 sys/dash 4 mmajor PD 5.61 0:00 0 1 (Priority) 0 1000037 0.00
mb-h100 sys/dash arccstaff2 kjohnson PD 0:00 1 (Priority) 4 100004116.40 mb-h100 sys/dash kjohnson PD0 0:00 1 (Priority) 0.00
llmproj 1000045 mb-h100 sys/dash 6 mmajor R 2-02:25:37192 5229.23 1 mbh100-003 2 64 1000058 mb-l40s Script.se10752.00
user1 doctorz R 1-00:58:25 1 mbl40s-003 4 128 10000624769.78 teton C1225-TT 0 user17 R 3-19:54:48 0 1 t507 0.00
johnsmith 1000065 teton C1225-TT user17 2 R 4-17:36:26 64 1 t502 |
|
Helpful flags when calling squeue to tailor your query
...
Flag
...
Used this when
...
Short Form
...
Short Form Ex.
...
Long Form
...
Useful flag info, Long Form Example & Output
...
me
...
To get a printout with just your jobs
...
n/a
...
n/a
...
--me
The --me flag, will print the squeue info, specifically about jobs submitted by you:
...
| title | Expand to see an example of squeue command run with --me flag, & output |
|---|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
12 15.22 0 0 0.00
classta2 |
|
...
...
...
...
...
...
...
user
...
To get a printout of a specific user’s jobs
...
-u
...
squeue -u joeblow
...
--user
The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:
...
| title | Expand to see an example of squeue command run with --user flag, and output |
|---|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
long
...
To get a printout of jobs including wall time
...
-l
...
squeue -l
...
--long
The --long flag (shown in the expandable example below) will print the above information as well as the wall time requested for the job.
...
| title | Expand to see an example of squeue command run with --long flag, and output |
|---|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
9600.00
....(CONT)
===============================================================================
TOTALS: |
|
...
...
...
...
...
...
...
...
...
...
===============================================================================
Nodes |
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
(47.19%)
=============================================================================== |
|
ARCCQUOTA: Get a report of your common HPC data storage locations and usage
arccquota shows information relating to storage quotas. By default, this will display $HOME and $SCRATCH quotas first, followed by the user's associated project quotas. This is a change on Teton from Mount Moran, but the tool is much more comprehensive. The command takes arguments to do project-only (i.e., no $HOME or $SCRATCH info displayed), extensive listing of users' quotas and usage within project directories, can summarize quotas (i.e., no user-specific usage on project spaces).
| Expand |
|---|
| title | Expand to view the default arccquota command and example output |
|---|
|
| Code Block |
|---|
[jsmith@mblog1 ~]$ arccquota
+----------------------------------------------------------------------+
| |
|
...
...
...
...
...
...
...
...
...
...
|
+----------------------------------------------------------------------+
| Path |
|
...
...
...
...
...
...
...
...
...
|
+----------------------------------------------------------------------+
| /home/jsmith |
|
...
...
...
...
...
| 31.35 GB 50.00 GB 62.71 |
| /gscratch/jsmith |
|
...
...
...
...
...
MB 05.00 TB 00.01 |
| /project/awesomeresearchproj | 04.96 GB 05.00 TB 00.10 |
+----------------------------------------------------------------------+ |
|
| Expand |
|---|
| title | Expand to view the arccquota command querying a specified user and example output |
|---|
|
| Code Block |
|---|
[jsmith@mblog1 ~]$ arccquota -u collaboratorfriend
+----------------------------------------------------------------------+
| |
|
...
...
...
...
...
...
...
...
...
|
+----------------------------------------------------------------------+
| |
|
...
...
...
...
...
...
...
...
...
...
+----------------------------------------------------------------------+
| /home/collaboratorfriend |
|
...
...
...
...
|
| /gscratch/collaboratorfriend |
|
...
...
...
...
05.00 TB 00.00 |
| /project/awesomeresearchproj | 04.96 GB 05.00 TB 00.10 |
|
...
format
...
To get squeue printout with specified format & output
...
-o
...
squeue -o Account,UserName,JobID,SubmitTime,StartTime,TimeLeft
...
--format
If appended with the --format flag, squeue info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run squeue --helpFormat to get a list of SLURM’s recognized column names)
...
| title | Expand to see an example of squeue command run with --format flag, and output |
|---|
...
|
+----------------------------------------------------------------------+
|
|
SHOWJOB: Get job parameters, and details for a job
Running showjob provides parameters specified when the job was requested, details about the job including ID, start and end times, nodes, cores, exit codes, state, the working directory, and fairshare information for the user and their associated projects.
| Expand |
|---|
| title | Expand to view the showjob command querying specific details about a job. |
|---|
|
| Code Block |
|---|
[userA@mblog1 ~]$ showjob 1234567
Job 1234567 is not in the current Slurm queue
Accounting information from the Slurm database:
Job parameters for jobid 1234567:
|
|
...
...
...
...
...
...
...
---------------- ----------- ------- ------------ --------- ----------
|
|
...
...
...
...
...
...
SACCT: Get information about recent or completed jobs on the cluster with sacct
The default sacct command: This print a list of your recent or recently completed jobs
| Expand |
|---|
| title | Expand to see an example of running sacct as default |
|---|
|
| Code Block |
|---|
[user17@mblog1 ~] sacct
JobID JobName Partition Account AllocCPUS State1-00:00:00
1234567.interactive interactive class2025A ExitCode ------------ ---------- ----------
---------- ---------- ---------- --------
1000000 sys/dashb+ mb aiproject 4 COMPLETED 0:0
1000000.bat+ batch aiproject 4 COMPLETED 0:0
1000000.ext+ extern aiproject 4 COMPLETED 0:0
1000003 sys/dashb+ mb aiproject 8 RUNNING 0:0
1000003.bat+ batch aiproject 8 RUNNING 0:0
1000003.ext+ extern aiproject 8 RUNNING 0:0 |
|
Helpful flags when calling sacct to tailor your query
...
Flag
...
Use this when
...
Short Form
...
Short Form Ex.
...
Long Form
...
Useful flag info, Long Form Example & Output
...
job
...
To get info about specific job#(s)
...
-j
...
sacct -j 1000013
...
--jobs
...
| title | Expand to see an example of running sacct with --jobs flag |
|---|
...
1234567.extern extern class2025A
Job details information for jobid 1234567:
JobID Submit Eligible Start Elapsed End CPUTime NNodes NCPUS ExitCode NodeList State
-------------------- --------- |
|
...
...
--------- ------------------- -------- --------- |
|
...
...
...
...
batch script
...
To view batch / submission script for a specific job
...
-B
...
sacct -j 1000101 -B
...
--batch-script
You must specify a job with the --jobs or -j flag to use the -B or --batch-script flag and see it’s associated batch / submission script. This will not work on interactive jobs run from an salloc command, or jobs that were not called from a script.
...
| title | Expand to see an example of running sacct with --batch-script flag and output |
|---|
...
------ ----- -------- --------- ---------
1234567 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02 1 1 0:0 mba30-004 COMPLETED
1234567.interactive 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02 1 1 0:0 mba30-004 COMPLETED
1234567.extern 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02 1 1 0:0 mba30-004 COMPLETED
Workdir
-------
/cluster/medbow/home/userA
User fairshare information from the sshare command:
Account User Partition RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
-------------------- ---------- ------------ ---------- ----------- ----------- ----------- ------------- |
|
...
user
...
To get a printout of a specific user’s jobs
...
-u
...
sacct -u joeblow
...
--user
The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:
...
| title | Expand to see an example of squeue command run with --user flag, and output |
|---|
...
---------- ---------- ------------------------------ ------------------------------
gr-distribstuff userA 1 0.500000 8430 0.000002 1.000000 0.534862 0.500000 |
|
...
cpu=0,mem=0,energy=0,node=0,b+
arcc-stuff |
|
...
...
...
...
...
...
...
...
...
...
start
...
To get a printout of job(s) starting after a date/time
...
-S
...
sacct -S 2024-11-01
...
--start
Dates and times should be specified with format YYYY-MM-DD-HH:MM
...
| title | Expand to see an example of running sacct with --start and output |
|---|
...
.090909 380014 0.000091 0.229302 0.335780 0.396460 |
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
cpu=0,mem=0,energy=0,node=0,b+
sept24class |
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
end
...
To get a printout of job(s) ending before a given date/time
...
-E
...
sacct -E 2024-11-24:12:00:00
--end
...
cpu=0,mem=0,energy=0,node=0,b+ |
|
Common SLURM Commands
The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag) to stipulate specific information that should be included in output.
SQUEUE: Get information about running and queued jobs on the cluster with squeue
This command is used to pull up information about the jobs that currently exist in the SLURM queue. This command run as default will print all running and queued jobs on the cluster listing each job’s job ID, partition, username, job status, number of nodes, and a node list, with the name of the nodes allocated to each job:
| Expand |
|---|
| title | Expand to see an example of |
|---|
|
...
| squeue command run and output |
|
|
...
squeue
JOBID PARTITION NAME USER ST |
|
...
...
...
...
...
format
...
To get sacct printout with specified format & output
...
-O
...
sacct -O Account,JobID
...
--format
If appended with the --format flag, sacct info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sacct --helpformat to get a list of SLURM’s recognized column names)
...
| title | Expand to see an example of sacct command run with --format flag, and output |
|---|
...
TIME NODES NODELIST(REASON)
1000001 inv-arcc myjob_11 user5 R 2-15:39:34 1 mba30-005
1000002 inv-lab2 AIML-CE joeblow R 6-13:02:32 1 mba30-004
1000005 inv-lab2 AIML-CE joeblow R 6-17:31:53 1 mba30-004
1000012 mb interact cowboyjoe R 2-21:28:49 1 mbcpu-010
1000015 mb sys/dash jsmith R 1:05:19 1 mbcpu-001
1000019 mb-a30 sys/dash janesmit R 8:45:36 1 mba30-006
1000022 mb-a30 Script.s doctorm PD 0:00 1 (Resources)
|
|
...
...
mb-a30 Script.22 doctorz R |
|
...
...
...
...
...
submit line
...
To view the submit command for a specified job
...
-o SubmitLine
...
sacct -o SubmitLine -j 1000101
...
--format=SubmitLine
...
This is a way of using the --format flag from above to see a print out of the command your entered to submit the specified job after the -j flag.
| Expand |
|---|
| title | Expand to see an example of running this command, and example output |
|---|
|
| Code Block |
|---|
[user11@mblog1 ~]$ sacct --format=SubmitLine -j 1000324
SubmitLine
--------------------
sbatch main_job.sh |
|
...
WorkDir
...
To view the working directory used by the job to execute commands
...
-o WorkDir
...
sacct -o WorkDir -j 1000101
...
--format=WorkDir
...
| Expand |
|---|
| title | Expand to see an example of running this command, and example output |
|---|
|
| Code Block |
|---|
[user11@mblog1 ~]$ sacct --format=WorkDir -j 1000324
WorkingDir
--------------------
/project/deeplearnlab/ |
|
My Job Failed. What Do these Exit Codes Mean?
Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:
...
Exit Code
...
Likely Cause
...
0
...
The job ran successfully
...
Any non-zero value
...
The job failed in some form or another
...
1
...
A general failure
...
2
...
Something was wrong with a shell command in the script
...
3 and above
...
Job error associated with software commands (check software specific exit codes)
...
0:9
...
The job was cancelled (usually the user or Slurm/System)
...
0:15
...
The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime)
...
0:53
...
Some file or directory referenced in the script was not readable or writable
...
0:125
...
Job ran out of memory
...
Anything else
...
Contact arcc-help@uwyo.edu to have us investigate
...
SINFO: Get information about cluster nodes and partitions
The default sinfo command: This print a list of all partitions, their states, availability, and associated nodes on the cluster
| Expand |
|---|
| title | Expand to see an example of running the default sinfo command and it's output, with no flags or arguments |
|---|
|
| Code Block |
|---|
[user1@mblog2 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
mb* up 7-00:00:00 1 mix mbcpu-007
mb* up 7-00:00:00 24 alloc mbcpu-[001-006,008-025]
mb-a30mb-h100 sys/dash mmajor PD 0:00 1 (Resources)
1000033 mb-h100 sys/dash mmajor PD 0:00 1 (Priority)
1000037 mb-h100 sys/dash kjohnson PD 0:00 1 (Priority)
1000041 mb-h100 sys/dash kjohnson PD 0:00 1 (Priority)
1000045 mb-h100 sys/dash mmajor R 2-02:25:37 1 mbh100-003
1000058 mb-l40s Script.se doctorz R 1-00:58:25 1 mbl40s-003
1000062 teton C1225-TT user17 R 3-19:54:48 1 t507
up 7-00:00:001000065 1 maint mba30-008
mb-a30teton C1225-TT user17 up 7-00:00:00R 4-17:36:26 3 mix mba30-[002,004,006]
mb-a301 t502 |
|
Helpful flags when calling squeue to tailor your query
Flag | Used this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
|---|
me | To get a printout with just your jobs | n/a | n/a | --me
| The --me flag, will print the squeue info, specifically about jobs submitted by you: | Expand |
|---|
| title | Expand to see an example of squeue command run with --me flag, & output |
|---|
| | Code Block |
|---|
[jsmith@mblog1 ~]$ squeue --me
|
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
|
user | To get a printout of a specific user’s jobs | -u
| squeue -u joeblow
| --user
| The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user: | Expand |
|---|
| title | Expand to see an example of squeue command run with --user flag, and output |
|---|
| | Code Block |
|---|
[jsmith@mblog1 ~]$ squeue --user=joeblow
|
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
|
long | To get a printout of jobs including wall time | -l | squeue -l
| --long
| The --long flag (shown in the expandable example below) will print the above information as well as the wall time requested for the job. | Expand |
|---|
| title | Expand to see an example of squeue command run with --long flag, and output |
|---|
| | Code Block |
|---|
squeue --long
Mon Jan 1 12:55:23 2020
|
|
|
...
...
...
...
...
...
...
...
...
TIME_LIMI NODES NODELIST(REASON)
|
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
mb interact cowboyjoe RUNNING 2-21:28:49 |
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
mb-a30 sys/dash janesmit RUNNING |
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
mb-a30 Script.22 doctorz RUNNING |
|
|
...
...
...
...
Helpful flags when calling sinfo to tailor your query
Flag | Used this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
|---|
state | Shows any nodes in state(s) specified | -t
| sinfo -t reserved
| --states
| The --states flag, will print the sinfo, listing nodes (if any) in the specified state and the number of nodes from each partition in the state. If none in a partition are in the state, the number of nodes will be 0 for that partition’s line.
| Expand |
|---|
| title | Expand to see an example of sinfo command run with --states flag, and output |
|---|
|
| Code Block |
[jsmith@mblog1 ~]$ sinfo --states=mixed
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
mb* up 7mba30-001 mba30-002 mba30-003
1000028 mb-h100 sys/dash mmajor PENDING 0:00 1-00:00:00 1 |
|
|
0n/amb-a30up7-00:00:00 3 mixmba30-[002,004,006]mb-l40sup7-00:003mix mbl40s-[001-003]
mb-h100 up7-00:00:004mixmbh100-[001-003,005]a6000h100 sys/dash kjohnson PENDING |
|
|
up7-00:00:001mix mba6000-001
wildiris up 7-000n/atetonup 7-00:00:00 mb-h100 sys/dash kjohnson PENDING |
|
|
3mixt[460,502,507]beartoothup 7-000n/ainv-arccupinfinite 0 n/ainv-inbreup72mixt[502,507]inv-ssheshapup7-00:00:001 mb-l40s Script.se doctorz |
|
|
mixmba6000-001
inv-wysbcup 700000n/ainvsoc up7-00:00:001mix mbl40s001inv-wildirisup 7-00:00:00 0 n/a
inv-klab up 72mixmba30-[002,004]inv-daleup7-00:00:000n/ainv-wsbcup mix mba30-006
non-investor up 7-00:00:00 1 mix t460 sinfo squeue printout with specified format & output | -
|
Osinfo -O NodeAddr,AllocatedMem,Coressqueue -o Account,UserName,JobID,SubmitTime,StartTime,TimeLeft
| --
|
Formatformat
| If appended with the -- |
Format sinfo squeue info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run |
sinfo squeue --helpFormat to get a list of SLURM’s recognized column names) | Expand |
|---|
| title | Expand to see an example of squeue command run with --format flag, and output |
|---|
| | Code Block |
|---|
[user17@mblog1 ~]$ |
|
|
sinfoAllocMemAllocNodesAvailableCoresCPus,CPUsLoad,Disk,Gres,Nodes,MemoryALLOCMEMALLOCNODESAVAILCORESCPUSCPU_LOADTMP_DISKGRESNODESMEMORY886016allup4896 90.25 2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
deeplearnlab user17 |
|
|
0(null) 2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
|
|
|
11023575user17 1000099 2024-08-14T10:31:06 |
|
|
924576 all up 48 96 96.06-96.12 0 (null) 5 1023575
511296 all up 48 96 95.84 0 (null) 1 1023575
393216 all up 48 96 96.45-96.56 0 (null) 2 1023575
588096 all up 48 96 89.97 0 (null) 1 1023575
570336 all up 48 96 96.31-96.43 0 (null) 3 1023575
629376 all up 48 96 96.23-96.40 0 (null) 5 1023575
514912 all up 48 96 92.31 0 (null) 1 1023575
688416 all up 48 96 96.33 0 (null) 1 1023575
798304 all up 48 96 93.06 0 (null) 1 1023575
857344 all up 48 96 93.08 0 (null) 1 1023575
865536 all up 48 96 96.10-96.25 0 (null) 2 1023575
806496 all up 48 96 96.23 0 (null) 1 1023575
102400 all up 48 96 42.22 0 gpu:a30:8 1 765525
208896 all up 48 96 82.04 0 gpu:a30:8 1 765525
524288 all up 48 96 0.02 0 gpu:a30:8 1 765525
49152 all up 48 96 585.36 0 gpu:a30:8 1 765525
0 all up 48 96 0.00-0.02 0 gpu:a30:8 4 765525
0 all up 12 12 0.00 0 gpu:l40s:1 1 75469
0 all up 48 96 0.00 0 gpu:l40s:8 1 765525
524288 all up 48 96 4.41-5.24 0 gpu:l40s:8 2 7655252024-08-14T10:31:07 6-04:42:49
|
|
|
** you can also runsqueue --helpto get a comprehensive list of flags available to run with the squeue command
SACCT: Get information about recent or completed jobs on the cluster with sacct
The default sacct command: This print a list of your recent or recently completed jobs
| Expand |
|---|
| title | Expand to see an example of running sacct as default |
|---|
|
| Code Block |
|---|
[user17@mblog1 ~] sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1000000 sys/dashb+ mb aiproject 4 COMPLETED 0:0
1000000.bat+ batch aiproject 4 COMPLETED 0:0
1000000.ext+ extern aiproject 4 COMPLETED 0:0
1000003 sys/dashb+ mb aiproject 8 RUNNING 0:0
1000003.bat+ batch aiproject 8 RUNNING 0:0
1000003.ext+ extern aiproject 8 RUNNING 0:0 |
|
Helpful flags when calling sacct to tailor your query
Flag | Use this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
|---|
job | To get info about specific job#(s) | -j
| sacct -j 1000013
| --jobs
| | Expand |
|---|
| title | Expand to see an example of running sacct with --jobs flag |
|---|
| | Code Block |
|---|
[user05@mblog1 ~] sacct --jobs=100013,100025
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1000013 sys/dashb+ mb mlproject 4 TIMEOUT 0:0
1000013.bat+ batch mlproject 4 CANCELLED 0:15
1000013.ext+ extern mlproject 4 COMPLETED 0:0
1000025 sys/dashb+ mb mlproject 8 RUNNING 0:0
1000025.bat+ batch mlproject 8 RUNNING 0:0
1000025.ext+ extern mlproject 8 RUNNING 0:0 |
|
|
batch script | To view batch / submission script for a specific job | -B
| sacct -j 1000101 -B
| --batch-script
| You must specify a job with the --jobs or -j flag to use the -B or --batch-script flag and see it’s associated batch / submission script. This will not work on interactive jobs run from an salloc command, or jobs that were not called from a script. | Expand |
|---|
| title | Expand to see an example of running sacct with --batch-script flag and output |
|---|
| | Code Block |
|---|
[user05@mblog1 ~] sacct -j 1000101 --batch-script
Batch Script for 1000101
---------------------------------------------------------------------
#!/bin/bash
#SBATCH --account=extrememl
#SBATCH --time=1:00:00
#SBATCH --mail-user=johnsmith@uwyo.edu
#SBATCH --mail-type=all
# Clear out and then load necessary software
module purge
module load gcc/14.2.0 r/4.4.0
# Browse to my project folder
cd /project/myprojdir/johnsmith/scripts/
# Export useful connection variables
export $HOSTNAME
# Run my code
R myscript.R |
|
|
user | To get a printout of a specific user’s jobs | -u
| sacct -u joeblow
| --user
| The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user: | Expand |
|---|
| title | Expand to see an example of squeue command run with --user flag, and output |
|---|
| | Code Block |
|---|
[joeblow@mblog1 ~]$ sacct --user=joeblow
JobID JobName Partition Account AllocCPUs State ExitCode
------- ------- --------- --------- --------- ------- --------
1000002 AIML-CE mb extremeai 4 RUNNING 0:0
1000005 AIML-CE mb extremeai 4 RUNNING 0:0 |
|
|
start | To get a printout of job(s) starting after a date/time | -S
| sacct -S 2024-11-01
| --start
| Dates and times should be specified with format YYYY-MM-DD-HH:MM | Expand |
|---|
| title | Expand to see an example of running sacct with --start and output |
|---|
| | Code Block |
|---|
[user05@mblog1 ~] sacct --start=2024-11-01
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1000013 sys/dashb+ mb mlproject 4 TIMEOUT 0:0
1000013.bat+ batch mlproject 4 CANCELLED 0:15
1000013.ext+ extern mlproject 4 COMPLETED 0:0
1000025 sys/dashb+ mb mlproject 8 RUNNING 0:0
1000025.bat+ batch mlproject 8 RUNNING 0:0
1000025.ext+ extern mlproject 8 RUNNING 0:0 |
|
|
end | To get a printout of job(s) ending before a given date/time | -E
| sacct -E 2024-11-24:12:00:00
| --end
| Dates and times should be specified with format YYYY-MM-DD-HH:MM | Expand |
|---|
| title | Expand to see an example of running sacct with --start and --end flags and output |
|---|
| | Code Block |
|---|
[user05@mblog1 ~] sacct --start=2024-11-01 --end=2024-11-24
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1000013 sys/dashb+ mb mlproject 4 TIMEOUT 0:0
1000013.bat+ batch mlproject 4 CANCELLED 0:15
1000013.ext+ extern mlproject 4 COMPLETED 0:0
1000025 sys/dashb+ mb mlproject 8 RUNNING 0:0
1000025.bat+ batch mlproject 8 RUNNING 0:0
1000025.ext+ extern mlproject 8 RUNNING 0:0 |
|
|
format | To get sacct printout with specified format & output | -O
| sacct -O Account,JobID
| --format
| If appended with the --format flag, sacct info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sacct --helpformat to get a list of SLURM’s recognized column names) | Expand |
|---|
| title | Expand to see an example of sacct command run with --format flag, and output |
|---|
| | Code Block |
|---|
[user17@mblog1 ~]$ sacct --Format="Account,JobID"
ACCOUNT JOBID
------------ -----------
deeplearnlab 1000062
deeplearnlab 1000091
deeplearnlab 1000099 |
|
|
submit line | To view the submit command for a specified job | -o SubmitLine
| sacct -o SubmitLine -j 1000101
| --format=SubmitLine
| This is a way of using the --format flag from above to see a print out of the command your entered to submit the specified job after the -j flag. | Expand |
|---|
| title | Expand to see an example of running this command, and example output |
|---|
| | Code Block |
|---|
[user11@mblog1 ~]$ sacct --format=SubmitLine -j 1000324
SubmitLine
--------------------
sbatch main_job.sh |
|
|
WorkDir | To view the working directory used by the job to execute commands | -o WorkDir
| sacct -o WorkDir -j 1000101
| --format=WorkDir
| | Expand |
|---|
| title | Expand to see an example of running this command, and example output |
|---|
| | Code Block |
|---|
[user11@mblog1 ~]$ sacct --format=WorkDir -j 1000324
WorkingDir
--------------------
/project/deeplearnlab/ |
|
|
My Job Failed. What Do these Exit Codes Mean?
Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:
Exit Code | Likely Cause |
|---|
0 | The job ran successfully |
Any non-zero value | The job failed in some form or another |
1 | A general failure |
2 | Something was wrong with a shell command in the script |
3 and above | Job error associated with software commands (check software specific exit codes) |
0:9 | The job was cancelled (usually the user or Slurm/System) |
0:15 | The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime) |
0:53 | Some file or directory referenced in the script was not readable or writable |
0:125 | Job ran out of memory |
Anything else | Contact arcc-help@uwyo.edu to have us investigate |
** you can also runsacct --helpto get a comprehensive list of flags available to run with the sacct command
SINFO: Get information about cluster nodes and partitions
The default sinfo command: This print a list of all partitions, their states, availability, and associated nodes on the cluster
| Expand |
|---|
| title | Expand to see an example of running the default sinfo command and it's output, with no flags or arguments |
|---|
|
| Code Block |
|---|
[user1@mblog2 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
mb* up 7-00:00:00 1 mix mbcpu-007
mb* up 7-00:00:00 24 alloc mbcpu-[001-006,008-025]
mb-a30 up 7-00:00:00 |
|
...
...
...
...
...
mix mba30-[002,004,006]
mb-a30 |
|
...
...
...
...
...
...
mba30-[001,003,007]
mb-l40s |
|
...
...
...
...
...
1 resv mbl40s-004
mb-l40s |
|
...
...
...
...
mix mbl40s-[001-003]
mb-l40s up 7-00:00:00 |
|
...
1 idle mbl40s-007
mb-h100 |
|
...
...
1 drain$ mbh100-004
mb-h100 |
|
...
...
...
mix mbh100-[001-003,005]
mb-a6000 |
|
...
...
...
1 mix mba6000-001
wildiris |
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
idle t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02],ttest[01-02]
beartooth |
|
...
...
up 7-00:00:00 1 idle b523
|
|
...
...
...
...
1 alloc mbcpu-025
inv-arcc |
|
...
...
2 idle ttest[01-02]
inv-inbre |
|
...
...
...
...
...
...
...
...
...
mbcpu-009
inv-inbre up 7-00:00:00 |
|
...
...
idle b523,mbl40s-007,t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02]
inv-ssheshap |
|
...
...
...
mix mba6000-001
inv-wysbc |
|
...
up 7-00:00:00 1 alloc mbcpu-001
inv-wysbc |
|
...
...
...
...
...
...
...
idle wi[001-005]
inv-klab |
|
...
...
...
mix mba30-[002,004],mbcpu-007
inv-klab up 7-00:00:00 |
|
...
...
alloc mba30-005,mbcpu-[002-006]
inv-klab |
|
...
...
...
...
...
...
...
...
SEFF: Analyze the efficiency of a completed job with seff
Below will just provide a short breakdown for using the seff command. Please see this page for a great and detailed description of how one could evaluate their job’s performance and efficiency.
The seff command will provide information about cpu and memory efficiency of your job, when provided a valid job number as the argument with seff <job#>. This information is only accurate assuming the job has completed successfully. Any jobs that are still running, or that complete with an out-of-memory error or other errors will have inaccurate seff output.
| Expand |
|---|
| title | Expand to view an example of using the seff command, and it's output |
|---|
|
| Code Block |
|---|
[]$ seff 10001001
Job ID: 10001001
Cluster: Medicinebow
User/Group: jsmith/mycoolproject
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node) |
|
...
| title | Links to more information on SLURM commands |
|---|
...
ARCCJOBS: Get a report of jobs currently running on the cluster
...
up 7-00:00:00 1 mix mba30-006
inv-wsbc up 7-00:00:00 1 alloc mbcpu-010
non-investor up 7-00:00:00 1 mix t460
non-investor up 7-00:00:00 14 alloc mbcpu-[011-024] |
|
Helpful flags when calling sinfo to tailor your query
Flag | Used this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
|---|
state | Shows any nodes in state(s) specified | -t
| sinfo -t reserved
| --states
| The --states flag, will print the sinfo, listing nodes (if any) in the specified state and the number of nodes from each partition in the state. If none in a partition are in the state, the number of nodes will be 0 for that partition’s line. | Expand |
|---|
| title | Expand to see an example of |
|---|
|
|
...
| sinfo command run with --states flag, and output |
| | Code Block |
|---|
[jsmith@mblog1 ~]$ |
|
|
...
sinfo --states=mixed
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
mb* up 7-00:00:00 0 |
|
|
...
n/a
mb-a30 up 7-00:00:00 3 |
|
|
...
mix mba30-[002,004,006]
mb-l40s up 7-00:00:00 |
|
|
...
...
mix mbl40s-[001-003]
mb-h100 up 7-00:00:00 4 |
|
|
...
mix mbh100-[001-003,005]
mb-a6000 up 7-00:00:00 1 |
|
|
...
...
...
...
...
...
mix t[460,502,507]
beartooth |
|
|
...
...
...
...
...
...
n/a
inv-inbre up 7-00:00:00 2 |
|
|
...
mix t[502,507]
inv-ssheshap |
|
|
...
...
...
mix mba6000-001
inv-wysbc |
|
|
...
...
...
...
n/a
inv-soc up 7-00:00:00 1 |
|
|
...
mix mbl40s-001
inv-wildiris |
|
|
...
...
...
...
...
...
mba30-[002,004]
inv-dale up 7-00:00:00 |
|
|
...
...
...
...
mix mba30-006
non-investor |
|
|
...
...
...
...
|
format | To get sinfo printout with specified format & output | -O
| sinfo -O NodeAddr,AllocatedMem,Cores
| --Format
| If appended with the --Format flag, sinfo info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sinfo --helpFormat to get a list of SLURM’s recognized column names) | Expand |
|---|
| title | Expand to see an example of squeue command run with --format flag, and output |
|---|
| | Code Block |
|---|
[user17@mblog1 ~]$ sinfo --Format="AllocMem,AllocNodes,Available,Cores,CPus,CPUsLoad,Disk,Gres,Nodes,Memory"
ALLOCMEM ALLOCNODES AVAIL CORES CPUS CPU_LOAD TMP_DISK GRES NODES |
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1023575
688416 all up 48 96 96.33 0 (null) 1 1023575
798304 all |
|
|
...
...
...
...
...
...
...
ARCCQUOTA: Get a report of your common HPC data storage locations and usage
arccquota shows information relating to storage quotas. By default, this will display $HOME and $SCRATCH quotas first, followed by the user's associated project quotas. This is a change on Teton from Mount Moran, but the tool is much more comprehensive. The command takes arguments to do project-only (i.e., no $HOME or $SCRATCH info displayed), extensive listing of users' quotas and usage within project directories, can summarize quotas (i.e., no user-specific usage on project spaces).
...
| title | Expand to view the default arccquota command and example output |
|---|
...
0 (null) 1 1023575
857344 all up 48 96 93.08 0 (null) 1 1023575
865536 all up 48 96 96.10-96.25 0 (null) 2 1023575
806496 all up 48 96 96.23 0 (null) 1 1023575
102400 all up 48 96 42.22 0 gpu:a30:8 1 765525
208896 all up 48 96 82.04 0 gpu:a30:8 1 765525
524288 all up 48 96 0.02 0 gpu:a30:8 1 765525
49152 all up 48 96 585.36 0 gpu:a30:8 1 765525
0 all up 48 96 0.00-0.02 0 gpu:a30:8 4 765525
0 all up 12 12 0.00 0 gpu:l40s:1 1 75469
0 all up 48 96 0.00 0 gpu:l40s:8 1 765525
524288 all up 48 96 4.41-5.24 0 gpu:l40s:8 2 765525
262144 all up 48 96 2.43 0 gpu:l40s:8 1 765525
0 all up 48 96 0.00 0 gpu:l40s:4 1 765525
0 all up 48 96 0.35 0 gpu:h100:8 1 1281554
524288 all up 48 96 0.26-12.20 0 gpu:h100:8 4 1281554
|
|
|
...
...
...
...
...
...
...
...
...
14+ 28+ 0.00-0.01 0 (null) |
|
|
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
| title | Expand to view the arccquota command querying a specified user and example output |
|---|
...
gpu:a30:2 1 1020129
32768 all up 16 32 15.17 |
|
|
...
...
...
...
...
...
...
...
...
...
SEFF: Analyze the efficiency of a completed job with seff
Below will just provide a short breakdown for using the seff command. Please see this page for a great and detailed description of how one could evaluate their job’s performance and efficiency.
The seff command will provide information about cpu and memory efficiency of your job, when provided a valid job number as the argument with seff <job#>. This information is only accurate assuming the job has completed successfully. Any jobs that are still running, or that complete with an out-of-memory error or other errors will have inaccurate seff output.
| Expand |
|---|
| title | Expand to view an example of using the seff command, and it's output |
|---|
|
| Code Block |
|---|
[]$ seff 10001001
Job ID: 10001001
Cluster: Medicinebow
User/Group: jsmith/mycoolproject
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node) |
|
| Expand |
|---|
| title | Links to more information on SLURM commands |
|---|
|
| Insert excerpt |
|---|
| Slurm Workload Manager |
|---|
| Slurm Workload Manager |
|---|
| name | Link to Slurm info |
|---|
|
|