Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel6
outlinefalse
styledefault
typelist
printabletrue

Overview: HPC Information and Compute Job Information

System querying is helpful to understand what is happening with the system. Meaning, what compute jobs are running, storage quotas, job history, etc. This page contains commands and examples of how to find that information.

...

ARCC Specific Commands

...

The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag) to stipulate specific information that should be included in output.

SQUEUE: Get information about running and queued jobs on the cluster with squeue

...

& Queries

ARCCJOBS: Get a report of jobs currently running on the cluster

arccjobs shows a summary of jobs, cpu resources, and requested/used cpu time. It doesn't take any arguments or options.

Expand
titleExpand to see an example of squeue command run and calling arccjobs and example of output
Code Block
squeue $ arccjobs 
===============================================================================
Account          JOBID PARTITION     NAME     USER ST   Running    TIME  NODES NODELIST(REASON)            1000001  inv-arcc myjob_11 Pending     user5  R 2-15:39:34      1 mba30-005   
  User      1000002  inv-lab2  AIML-CE   joeblow  R 6-13:02:32   jobs   1 mba30-004cpus         cpuh   1000005 jobs inv-lab2  AIML-CE   joeblow  R 6-17:31:53 cpus       1 mba30-004            1000012        mb interact cowboyjoe  R 2-21:28:49cpuh
===============================================================================
advanceddl         1 mbcpu-010        1    1000015   1     mb sys/dash   1.09 jsmith  R    1:05:190      1 mbcpu-0010         0.00
  1000019joeblow    mb-a30 sys/dash  janesmit  R    8:45:36      1  mba30-006     1       1000022  1.09  mb-a30 Script.s   doctorm PD0       0:00      1 (Resources)   0.00
arcc              1000025    mb-a30 Script.22  doctorz  R 2   7:05:44    8  1 mba30-001     22.01       10000280   mb-h100 sys/dash   0 mmajor PD       0:.00
  arcchelper1    1 (Resources)          1  1000033    mb-h100 sys/dash 4    mmajor PD    5.61   0:00    0  1 (Priority)    0        1000037 0.00
 mb-h100 sys/dash arccstaff2  kjohnson PD       0:00      1 (Priority)      4        100004116.40   mb-h100 sys/dash  kjohnson PD0       0:00       1 (Priority) 0.00
llmproj          1000045   mb-h100 sys/dash         6   mmajor  R 2-02:25:37192      5229.23 1 mbh100-003     2      64 1000058   mb-l40s Script.se10752.00
  user1          doctorz  R 1-00:58:25      1 mbl40s-003 4     128      10000624769.78     teton C1225-TT 0   user17  R 3-19:54:48 0     1 t507   0.00
  johnsmith      1000065     teton C1225-TT    user17 2 R 4-17:36:26    64  1 t502

Helpful flags when calling squeue to tailor your query

...

Flag

...

Used this when

...

Short Form

...

Short Form Ex.

...

Long Form

...

Useful flag info, Long Form Example & Output

...

me

...

To get a printout with just your jobs

...

n/a

...

n/a

...

--me

The --me flag, will print the squeue info, specifically about jobs submitted by you:

...

titleExpand to see an example of squeue command run with --me flag, & output

...

     459.45       2      

...

64  

...

   10752.00 

...

physicsclass     

...

  

...

       

...

  2      

...

13 

...

       16.34     

...

  0       0         0.00
  student5   

...

   

...

      

...

      1

...

      12        15.22       0       0         0.00
  classta2   

...

  

...

  

...

   

...

   

...

     1 

...

      1         

...

user

...

To get a printout of a specific user’s jobs

...

-u

...

squeue -u joeblow

...

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

...

titleExpand to see an example of squeue command run with --user flag, and output

...

1.12       0       0   

...

  

...

    0.00 

...


researchlab    

...

  

...

       

...

  

...

 

...

14     613       

...

882.26  

...

  

...

   

...

2  

...

 

...

  120    

...

 

...

 9600.00
  gradresrcher1        

...

  

...

  

...

 2  

...

  

...

 

...

  9    

...

 

...

 

...

long

...

To get a printout of jobs including wall time

...

-l

...

squeue -l

...

--long

The --long flag (shown in the expandable example below) will print the above information as well as the wall time requested for the job.

...

titleExpand to see an example of squeue command run with --long flag, and output

...

 

...

  

...

5.82 

...

 

...

 

...

 

...

   0       0    

...

 

...

    0.00

...

  researcher18   

...

    

...

      12    

...

 604       

...

876.43     

...

 

...

 2     120      

...

9600.00
....(CONT)
===============================================================================
TOTALS:     

...

  

...

        

...

    

...

25    

...

827     

...

  41597.79     320     

...

500  

...

  

...

 22248.00

...

===============================================================================
Nodes          

...

    

...

   

...

     

...

39/79      (49.37%)
Cores     

...

  

...

  

...

   

...

  

...

      2514/5492  

...

  (45.78%)

...

Memory (GB)   

...

     

...

     16025/60278   (26.58%)
CPU 

...

Load        

...

 

...

 

...

  

...

  2591.46      

...

   

...

(47.19%)
===============================================================================

ARCCQUOTA: Get a report of your common HPC data storage locations and usage

arccquota shows information relating to storage quotas. By default, this will display $HOME and $SCRATCH quotas first, followed by the user's associated project quotas. This is a change on Teton from Mount Moran, but the tool is much more comprehensive. The command takes arguments to do project-only (i.e., no $HOME or $SCRATCH info displayed), extensive listing of users' quotas and usage within project directories, can summarize quotas (i.e., no user-specific usage on project spaces).

Expand
titleExpand to view the default arccquota command and example output
Code Block
 [jsmith@mblog1 ~]$ arccquota 
+----------------------------------------------------------------------+
|           

...

   arccquota 

...

 

...

  

...

  

...

        |   

...

    

...

   

...

   Block  

...

            

...

|
+----------------------------------------------------------------------+
|                Path  

...

    

...

   

...

     

...

   | Used       Limit 

...

    

...

%   

...

  

...

    

...

|
+----------------------------------------------------------------------+
| /home/jsmith      

...

 

...

 

...

            

...

   

...

 | 31.35 GB   50.00 GB  62.71     |
| /gscratch/jsmith     

...

    

...

   

...

     

...

   | 550.44 

...

MB  05.00 TB  00.01     |
| /project/awesomeresearchproj        | 04.96 GB   05.00 TB  00.10     |
+----------------------------------------------------------------------+
Expand
titleExpand to view the arccquota command querying a specified user and example output
Code Block
 [jsmith@mblog1 ~]$ arccquota -u collaboratorfriend
+----------------------------------------------------------------------+
|            

...

  arccquota     

...

   

...

     

...

 |           

...

  

...

Block     

...

  

...

       

...

|
+----------------------------------------------------------------------+
|            

...

    

...

Path   

...

     

...

         | Used  

...

     

...

Limit 

...

    

...

%  

...

       |

...

+----------------------------------------------------------------------+
| /home/collaboratorfriend      

...

      | 49.55 GB   50.00 

...

GB  99.20   

...

 

...

 |
| /gscratch/collaboratorfriend 

...

  

...

     | 5.4 MB 

...

    

...

05.00 TB  00.00     |
| /project/awesomeresearchproj        | 04.96 GB   05.00 TB  00.10     

...

format

...

To get squeue printout with specified format & output

...

-o

...

squeue -o Account,UserName,JobID,SubmitTime,StartTime,TimeLeft

...

--format

If appended with the --format flag, squeue info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run squeue --helpFormat to get a list of SLURM’s recognized column names)

...

titleExpand to see an example of squeue command run with --format flag, and output

...

|
+----------------------------------------------------------------------+
 

SHOWJOB: Get job parameters, and details for a job

Running showjob provides parameters specified when the job was requested, details about the job including ID, start and end times, nodes, cores, exit codes, state, the working directory, and fairshare information for the user and their associated projects.

Expand
titleExpand to view the showjob command querying specific details about a job.
Code Block
[userA@mblog1 ~]$ showjob 1234567

Job 1234567 is not in the current Slurm queue

Accounting information from the Slurm database:

Job parameters for jobid 1234567:
           

...

    JobID     JobName  

...

  User 

...

    

...

 Account   

...

  Partition   Timelimit 

...

----

...

---------------- ----------- ------- ------------     ---------  ---------- 
         

...

  

...

 12384567 interactive   

...

userA   class2025A 

...

    

...

   mb-a30  

...

SACCT: Get information about recent or completed jobs on the cluster with sacct

The default sacct command: This print a list of your recent or recently completed jobs

Expand
titleExpand to see an example of running sacct as default
Code Block
[user17@mblog1 ~] sacct

JobID           JobName  Partition    Account  AllocCPUS      State1-00:00:00 
 1234567.interactive interactive           class2025A                  ExitCode  ------------ ---------- ---------- 
  ---------- ---------- ----------   -------- 
1000000      sys/dashb+         mb     aiproject     4      COMPLETED      0:0 
1000000.bat+      batch                aiproject     4      COMPLETED      0:0 
1000000.ext+     extern                aiproject     4      COMPLETED      0:0 
1000003      sys/dashb+         mb     aiproject     8      RUNNING        0:0 
1000003.bat+      batch                aiproject     8      RUNNING        0:0 
1000003.ext+     extern                aiproject     8      RUNNING        0:0 

Helpful flags when calling sacct to tailor your query

...

Flag

...

Use this when

...

Short Form

...

Short Form Ex.

...

Long Form

...

Useful flag info, Long Form Example & Output

...

job

...

To get info about specific job#(s)

...

-j

...

sacct -j 1000013

...

--jobs

...

titleExpand to see an example of running sacct with --jobs flag

...

1234567.extern      extern           class2025A                      

Job details information for jobid 1234567:
               JobID              Submit            Eligible               Start  Elapsed                 End  CPUTime NNodes NCPUS ExitCode  NodeList     State 
-------------------- ---------

...

---------- ----------

...

--------- ------------------- -------- ---------

...

----------

...

 

...

--------

...

batch script

...

To view batch / submission script for a specific job

...

-B

...

sacct -j 1000101 -B

...

--batch-script

You must specify a job with the --jobs or -j flag to use the -B or --batch-script flag and see it’s associated batch / submission script. This will not work on interactive jobs run from an salloc command, or jobs that were not called from a script.

...

titleExpand to see an example of running sacct with --batch-script flag and output

...

 ------ ----- -------- --------- --------- 
            1234567 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02      1     1      0:0 mba30-004 COMPLETED 
1234567.interactive 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02      1     1      0:0 mba30-004 COMPLETED 
     1234567.extern 2025-06-27T12:28:30 2025-06-27T12:28:30 2025-06-27T12:28:30 00:00:02 2025-06-27T12:28:32 00:00:02      1     1      0:0 mba30-004 COMPLETED 

Workdir
-------
/cluster/medbow/home/userA

User fairshare information from the sshare command:
Account                    User    Partition  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage  FairShare    LevelFS                    GrpTRESMins                    TRESRunMins 
-------------------- ---------- ------------ ---------- ----------- ----------- ----------- -------------

...

user

...

To get a printout of a specific user’s jobs

...

-u

...

sacct -u joeblow

...

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

...

titleExpand to see an example of squeue command run with --user flag, and output

...

 ---------- ---------- ------------------------------ ------------------------------ 
gr-distribstuff           userA                       1    0.500000        8430    0.000002      1.000000   0.534862   0.500000                               

...

 cpu=0,mem=0,energy=0,node=0,b+ 
arcc-stuff     

...

  

...

      

...

 

...

  userA 

...

   

...

       

...

        

...

  

...

  1    0

...

start

...

To get a printout of job(s) starting after a date/time

...

-S

...

sacct -S 2024-11-01

...

--start

Dates and times should be specified with format YYYY-MM-DD-HH:MM

...

titleExpand to see an example of running sacct with --start and output

...

.090909      380014    0.000091      0.229302   0.335780   0.396460            

...

 

...

 

...

 

...

    

...

 

...

 

...

   

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 cpu=0,mem=0,energy=0,node=0,b+ 
sept24class          

...

 

...

 

...

 

...

 

...

 

...

userA 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

   

...

  

...

 

...

 

...

 

...

 

...

 

...

1 

...

 

...

 

...

 

...

0.028571 

...

 

...

 

...

 

...

 

...

   

...

 

...

  0 

...

 

...

 

...

 0

...

.000000 

...

 

...

 

...

 

...

 

...

 

...

0.011517 

...

 

...

 0.733945   2.480788       

...

  

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

     

...

end

...

To get a printout of job(s) ending before a given date/time

...

-E

...

sacct -E 2024-11-24:12:00:00

--end

...

 cpu=0,mem=0,energy=0,node=0,b+ 

Common SLURM Commands

The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag) to stipulate specific information that should be included in output.

SQUEUE: Get information about running and queued jobs on the cluster with squeue

This command is used to pull up information about the jobs that currently exist in the SLURM queue. This command run as default will print all running and queued jobs on the cluster listing each job’s job ID, partition, username, job status, number of nodes, and a node list, with the name of the nodes allocated to each job:

Expand
titleExpand to see an example of

...

squeue command run and output
Code Block

...

squeue
             JOBID PARTITION     NAME     USER ST 

...

 

...

 

...

 

...

   

...

format

...

To get sacct printout with specified format & output

...

-O

...

sacct -O Account,JobID

...

--format

If appended with the --format flag, sacct info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sacct --helpformat to get a list of SLURM’s recognized column names)

...

titleExpand to see an example of sacct command run with --format flag, and output

...

TIME  NODES NODELIST(REASON)
           1000001  inv-arcc myjob_11     user5  R 2-15:39:34      1 mba30-005
           1000002  inv-lab2  AIML-CE   joeblow  R 6-13:02:32      1 mba30-004
           1000005  inv-lab2  AIML-CE   joeblow  R 6-17:31:53      1 mba30-004
           1000012        mb interact cowboyjoe  R 2-21:28:49      1 mbcpu-010
           1000015        mb sys/dash    jsmith  R    1:05:19      1 mbcpu-001
           1000019    mb-a30 sys/dash  janesmit  R    8:45:36      1 mba30-006
           1000022    mb-a30 Script.s   doctorm PD       0:00      1 (Resources)
        

...

   1000025 

...

   mb-a30 Script.22  doctorz  R    

...

7:05:44    

...

  1 mba30-001
      

...

    

...

 1000028   

...

submit line

...

To view the submit command for a specified job

...

-o SubmitLine

...

sacct -o SubmitLine -j 1000101

...

--format=SubmitLine

...

This is a way of using the --format flag from above to see a print out of the command your entered to submit the specified job after the -j flag.

Expand
titleExpand to see an example of running this command, and example output
Code Block
[user11@mblog1 ~]$ sacct --format=SubmitLine -j 1000324
          SubmitLine 
-------------------- 
  sbatch main_job.sh 

...

WorkDir

...

To view the working directory used by the job to execute commands

...

-o WorkDir

...

sacct -o WorkDir -j 1000101

...

--format=WorkDir

...

Expand
titleExpand to see an example of running this command, and example output
Code Block
[user11@mblog1 ~]$ sacct --format=WorkDir -j 1000324
          WorkingDir 
-------------------- 
  /project/deeplearnlab/ 

My Job Failed. What Do these Exit Codes Mean?

Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:

...

Exit Code

...

Likely Cause

...

0

...

The job ran successfully

...

Any non-zero value

...

The job failed in some form or another

...

1

...

A general failure

...

2

...

Something was wrong with a shell command in the script

...

3 and above

...

Job error associated with software commands (check software specific exit codes)

...

0:9

...

The job was cancelled (usually the user or Slurm/System)

...

0:15

...

The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime)

...

0:53

...

Some file or directory referenced in the script was not readable or writable

...

0:125

...

Job ran out of memory

...

Anything else

...

Contact arcc-help@uwyo.edu to have us investigate

...

SINFO: Get information about cluster nodes and partitions

The default sinfo command: This print a list of all partitions, their states, availability, and associated nodes on the cluster

Expand
titleExpand to see an example of running the default sinfo command and it's output, with no flags or arguments
Code Block
[user1@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      1    mix mbcpu-007
mb*             up 7-00:00:00     24  alloc mbcpu-[001-006,008-025]
mb-a30mb-h100 sys/dash    mmajor PD       0:00      1 (Resources)
           1000033   mb-h100 sys/dash    mmajor PD       0:00      1 (Priority)
           1000037   mb-h100 sys/dash  kjohnson PD       0:00      1 (Priority)
           1000041   mb-h100 sys/dash  kjohnson PD       0:00      1 (Priority)
           1000045   mb-h100 sys/dash    mmajor  R 2-02:25:37      1 mbh100-003
           1000058   mb-l40s Script.se  doctorz  R 1-00:58:25      1 mbl40s-003
           1000062     teton C1225-TT    user17  R 3-19:54:48      1 t507
           up 7-00:00:001000065      1  maint mba30-008
mb-a30teton C1225-TT    user17      up 7-00:00:00R 4-17:36:26      3    mix mba30-[002,004,006]
mb-a301 t502

Helpful flags when calling squeue to tailor your query

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

me

To get a printout with just your jobs

n/a

n/a

--me

The --me flag, will print the squeue info, specifically about jobs submitted by you:

Expand
titleExpand to see an example of squeue command run with --me flag, & output
Code Block
[jsmith@mblog1 ~]$ squeue --me
          

...

 

...

 JOBID  PARTITION   

...

  

...

NAME 

...

 

...

   USER  ST     

...

 

...

 TIME     

...

   

...

NODES NODELIST(REASON)
         

...

 

...

 1000002  inv-lab2  AIML-CE 

...

  

...

jsmith 

...

 

...

 R      

...

6-13:02:32      1  

...

mba30-004

...

         

...

 

...

 1000005  inv-lab2  AIML-CE 

...

  jsmith  

...

 R      

...

6-17:31:53      1  

...

mba30-004

user

To get a printout of a specific user’s jobs

-u

squeue -u joeblow

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

Expand
titleExpand to see an example of squeue command run with --user flag, and output
Code Block
[jsmith@mblog1 ~]$ squeue --user=joeblow
     

...

 

...

      

...

JOBID 

...

 

...

PARTITION     NAME     

...

USER  ST     

...

  TIME  

...

NODES NODELIST(REASON)
       

...

 

...

   1000002  inv-lab2 

...

 AIML-CE   joeblow  

...

R 6-13:02:32      

...

1 mba30-004
           1000005 

...

 

...

inv-lab2  AIML-CE   joeblow 

...

 

...

R 6-17:31:53      

...

1 

...

mba30-004

long

To get a printout of jobs including wall time

-l

squeue -l

--long

The --long flag (shown in the expandable example below) will print the above information as well as the wall time requested for the job.

Expand
titleExpand to see an example of squeue command run with --long flag, and output
Code Block
squeue --long

Mon Jan 1 12:55:23 2020
             

...

JOBID PARTITION     

...

NAME   

...

  USER    STATE    

...

 

...

    TIME  

...

   

...

 

...

 

...

 TIME_LIMI     NODES NODELIST(REASON)

...

   

...

      

...

  

...

1000001 

...

 inv-arcc myjob_11     user5  

...

RUNNING  

...

      2-15:39:34   

...

 

...

3-00:00:00   1    

...

 

...

mba30-005
     

...

  

...

 

...

 

...

  1000002  inv-lab2  AIML-CE 

...

 

...

 joeblow  RUNNING   

...

    

...

 6-13:11:23   

...

 7-00:00:00   1     mba30-004
   

...

  

...

 

...

 

...

    1000005  inv-lab2 

...

 

...

AIML-CE   joeblow  RUNNING  

...

    

...

 

...

 6-17:31:53   

...

 7-00:00:00

...

   1    

...

 

...

mba30-

...

004

...

       

...

 

...

   1000012   

...

  

...

   mb interact cowboyjoe  RUNNING        2-21:28:49   

...

 

...

3-00:00:00

...

   

...

1   

...

 

...

 

...

mbcpu-

...

010
        

...

 

...

  1000015    

...

    

...

mb sys/dash    jsmith 

...

 RUNNING    

...

   

...

 

...

 

...

  1:05:19      

...

 

...

5:00:00   

...

1    

...

 

...

mbcpu-

...

001

...

        

...

 

...

  1000019    

...

mb-a30 sys/dash  janesmit  RUNNING       

...

 

...

   8:45:36   

...

 

...

4-09:00:00   1     

...

mba30-006
     

...

  

...

 

...

 

...

  1000022    mb-a30 Script.s 

...

 

...

 doctorm  PENDING   

...

    

...

 

...

 

...

     0:00   

...

 

...

1-00:00:00   1   

...

  

...

(Resources)
     

...

 

...

     1000025 

...

   

...

mb-a30 Script.22  doctorz  RUNNING          

...

 7:05:44    1-00:00:00   3  

...

  

...

 

...

Helpful flags when calling sinfo to tailor your query

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

state

Shows any nodes in state(s) specified

-t

sinfo -t reserved

--states

The --states flag, will print the sinfo, listing nodes (if any) in the specified state and the number of nodes from each partition in the state. If none in a partition are in the state, the number of nodes will be 0 for that partition’s line.

Expand
titleExpand to see an example of sinfo command run with --states flag, and output
Code Block[jsmith@mblog1 ~]$ sinfo --states=mixed PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mb* up 7
mba30-001 mba30-002 mba30-003
           1000028   mb-h100 sys/dash    mmajor  PENDING              0:00    1-00:00:00   1   
0
  (Resources)
 
n/a
  
mb-a30
        1000033  
up
 
7-00:00:00
mb-h100 sys/dash    mmajor 
3
 
PENDING  
mix
 
mba30-[002,004,006]
 
mb-l40s
         
up
 
7-00:00
0:00      
3
 1:00:00   
mix mbl40s-[001-003] mb-h100
1     
(Priority)
 
up
 
7-00:00:00
      
4
   1000037 
mix
 
mbh100-[001-003,005]
 mb-
a6000
h100 sys/dash  kjohnson  PENDING   
up
 
7-00:00:00
      
1
    
mix mba6000-001 wildiris
0:00       
up 7-00
5:00:00   1   
0
  (Priority)
 
n/a
  
teton
        1000041   
up 7-00:00:00
mb-h100 sys/dash  kjohnson  PENDING     
3
    
mix
 
t[460,502,507]
 
beartooth
   0:00    
up 7-00
   2:00:00   1   
0
  (Priority)
 
n/a
  
inv-arcc
        
up
1000045   
infinite
mb-h100 sys/dash    mmajor 
0
 
RUNNING  
n/a
  
inv-inbre
    2-02:25:37   
up
 
7
3-00:00:00   1   
2
  mbh100-003
 
mix
 
t[502,507]
 
inv-ssheshap
    
up
 
7-00:00:00
   1000058   
1
mb-l40s Script.se  doctorz 
mix
 
mba6000-001 inv-wysbc
RUNNING        
up 7
1-00:
00
58:
00
25    2-00:00:00  
0
 1   
n/a
  
inv
mbl40s-
soc
003
      
up
 
7-00:00:00
    1000062  
1
   teton 
mix mbl40s
C1225-
001
TT 
inv-wildiris
   user17 
up 7-00:00:00
 RUNNING     
0
   
n/a inv-klab
3-19:54:48    
up 7
5-00:00:00   1     
2
t507
   
mix
 
mba30-[002,004]
 
inv-dale
      1000065  
up
 
7-00:00:00
  teton C1225-TT   
0
 user17  RUNNING 
n/a
  
inv-wsbc
     4-17:36:26   
up
 7-00:00:00   
1    
mix mba30-006 non-investor up 7-00:00:00 1 mix t460
 t502

format

To get

sinfo

squeue printout with specified format & output

-

O

o

sinfo -O NodeAddr,AllocatedMem,Cores

squeue -o Account,UserName,JobID,SubmitTime,StartTime,TimeLeft

--

Format

format

If appended with the --

Format

format flag,

sinfo

squeue info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run

sinfo

squeue --helpFormat to get a list of SLURM’s recognized column names)

Expand
titleExpand to see an example of squeue command run with --format flag, and output
Code Block
[user17@mblog1 ~]$ 
sinfo
squeue --Format="
AllocMem
Account,
AllocNodes
UserName,
Available
JobID,
Cores
SubmitTime,
CPus,CPUsLoad,Disk,Gres,Nodes,Memory
StartTime,TimeLeft"
ALLOCMEM

ALLOCNODES
Mon 
AVAIL
Jan 1 12:55:23 
CORES
2020
  
CPUS
ACCOUNT     
CPU_LOAD
    
TMP_DISK
USER  
GRES
    JOBID   
NODES
    
MEMORY
  SUBMIT_TIME          START_TIME   
886016
     
all
   TIME_LEFT
  deeplearnlab 
up
   user17    
48
1000062      
96 90.25
 2024-08-14T10:31:07  2024-08-14T10:31:09  6-04:42:51
  deeplearnlab    user17    
0
1000091     
(null)
  2024-08-14T10:31:06  2024-08-14T10:31:07  6-04:42:49
 
1
 deeplearnlab    
1023575
user17    1000099       2024-08-14T10:31:06  
924576 all up 48 96 96.06-96.12 0 (null) 5 1023575 511296 all up 48 96 95.84 0 (null) 1 1023575 393216 all up 48 96 96.45-96.56 0 (null) 2 1023575 588096 all up 48 96 89.97 0 (null) 1 1023575 570336 all up 48 96 96.31-96.43 0 (null) 3 1023575 629376 all up 48 96 96.23-96.40 0 (null) 5 1023575 514912 all up 48 96 92.31 0 (null) 1 1023575 688416 all up 48 96 96.33 0 (null) 1 1023575 798304 all up 48 96 93.06 0 (null) 1 1023575 857344 all up 48 96 93.08 0 (null) 1 1023575 865536 all up 48 96 96.10-96.25 0 (null) 2 1023575 806496 all up 48 96 96.23 0 (null) 1 1023575 102400 all up 48 96 42.22 0 gpu:a30:8 1 765525 208896 all up 48 96 82.04 0 gpu:a30:8 1 765525 524288 all up 48 96 0.02 0 gpu:a30:8 1 765525 49152 all up 48 96 585.36 0 gpu:a30:8 1 765525 0 all up 48 96 0.00-0.02 0 gpu:a30:8 4 765525 0 all up 12 12 0.00 0 gpu:l40s:1 1 75469 0 all up 48 96 0.00 0 gpu:l40s:8 1 765525 524288 all up 48 96 4.41-5.24 0 gpu:l40s:8 2 765525
2024-08-14T10:31:07  6-04:42:49

** you can also runsqueue --helpto get a comprehensive list of flags available to run with the squeue command

SACCT: Get information about recent or completed jobs on the cluster with sacct

The default sacct command: This print a list of your recent or recently completed jobs

Expand
titleExpand to see an example of running sacct as default
Code Block
[user17@mblog1 ~] sacct

JobID           JobName  Partition    Account  AllocCPUS      State      ExitCode 
------------ ---------- ----------    ---------- ---------- ----------   -------- 
1000000      sys/dashb+         mb     aiproject     4      COMPLETED      0:0 
1000000.bat+      batch                aiproject     4      COMPLETED      0:0 
1000000.ext+     extern                aiproject     4      COMPLETED      0:0 
1000003      sys/dashb+         mb     aiproject     8      RUNNING        0:0 
1000003.bat+      batch                aiproject     8      RUNNING        0:0 
1000003.ext+     extern                aiproject     8      RUNNING        0:0 

Helpful flags when calling sacct to tailor your query

Flag

Use this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

job

To get info about specific job#(s)

-j

sacct -j 1000013

--jobs

Expand
titleExpand to see an example of running sacct with --jobs flag
Code Block
[user05@mblog1 ~] sacct --jobs=100013,100025

JobID           JobName  Partition    Account  AllocCPUS      State      ExitCode 
------------ ---------- ----------    ---------- ---------- ----------   -------- 
1000013      sys/dashb+         mb     mlproject     4        TIMEOUT      0:0 
1000013.bat+      batch                mlproject     4      CANCELLED     0:15 
1000013.ext+     extern                mlproject     4      COMPLETED      0:0 
1000025      sys/dashb+         mb     mlproject     8        RUNNING      0:0 
1000025.bat+      batch                mlproject     8        RUNNING      0:0 
1000025.ext+     extern                mlproject     8        RUNNING      0:0 

batch script

To view batch / submission script for a specific job

-B

sacct -j 1000101 -B

--batch-script

You must specify a job with the --jobs or -j flag to use the -B or --batch-script flag and see it’s associated batch / submission script. This will not work on interactive jobs run from an salloc command, or jobs that were not called from a script.

Expand
titleExpand to see an example of running sacct with --batch-script flag and output
Code Block
[user05@mblog1 ~] sacct -j 1000101 --batch-script
Batch Script for 1000101
---------------------------------------------------------------------
#!/bin/bash
#SBATCH --account=extrememl
#SBATCH --time=1:00:00
#SBATCH --mail-user=johnsmith@uwyo.edu
#SBATCH --mail-type=all

# Clear out and then load necessary software
module purge
module load gcc/14.2.0 r/4.4.0

# Browse to my project folder
cd /project/myprojdir/johnsmith/scripts/

# Export useful connection variables
export $HOSTNAME

# Run my code
R myscript.R 

user

To get a printout of a specific user’s jobs

-u

sacct -u joeblow

--user

The --user or -u flag, (shown in the expandable example below specifying a username), prints squeue info, specifically about jobs submitted by a specified user:

Expand
titleExpand to see an example of squeue command run with --user flag, and output
Code Block
[joeblow@mblog1 ~]$ sacct --user=joeblow
JobID     JobName Partition  Account   AllocCPUs State   ExitCode
-------   ------- ---------  --------- --------- ------- --------
1000002   AIML-CE   mb       extremeai        4  RUNNING      0:0
1000005   AIML-CE   mb       extremeai        4  RUNNING      0:0

start

To get a printout of job(s) starting after a date/time

-S

sacct -S 2024-11-01

--start

Dates and times should be specified with format YYYY-MM-DD-HH:MM

Expand
titleExpand to see an example of running sacct with --start and output
Code Block
[user05@mblog1 ~] sacct --start=2024-11-01

JobID           JobName  Partition    Account  AllocCPUS      State      ExitCode 
------------ ---------- ----------    ---------- ---------- ----------   -------- 
1000013      sys/dashb+         mb     mlproject     4        TIMEOUT      0:0 
1000013.bat+      batch                mlproject     4      CANCELLED     0:15 
1000013.ext+     extern                mlproject     4      COMPLETED      0:0 
1000025      sys/dashb+         mb     mlproject     8        RUNNING      0:0 
1000025.bat+      batch                mlproject     8        RUNNING      0:0 
1000025.ext+     extern                mlproject     8        RUNNING      0:0 

end

To get a printout of job(s) ending before a given date/time

-E

sacct -E 2024-11-24:12:00:00

--end

Dates and times should be specified with format YYYY-MM-DD-HH:MM

Expand
titleExpand to see an example of running sacct with --start and --end flags and output
Code Block
[user05@mblog1 ~] sacct --start=2024-11-01 --end=2024-11-24

JobID           JobName  Partition    Account  AllocCPUS      State      ExitCode 
------------ ---------- ----------    ---------- ---------- ----------   -------- 
1000013      sys/dashb+         mb     mlproject     4        TIMEOUT      0:0 
1000013.bat+      batch                mlproject     4      CANCELLED     0:15 
1000013.ext+     extern                mlproject     4      COMPLETED      0:0 
1000025      sys/dashb+         mb     mlproject     8        RUNNING      0:0 
1000025.bat+      batch                mlproject     8        RUNNING      0:0 

1000025.ext+     extern                mlproject     8        RUNNING      0:0 

format

To get sacct printout with specified format & output

-O

sacct -O Account,JobID

--format

If appended with the --format flag, sacct info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sacct --helpformat to get a list of SLURM’s recognized column names)

Expand
titleExpand to see an example of sacct command run with --format flag, and output
Code Block
[user17@mblog1 ~]$ sacct --Format="Account,JobID"
  ACCOUNT          JOBID
  ------------    -----------             
  deeplearnlab    1000062         
  deeplearnlab    1000091       
  deeplearnlab    1000099    

submit line

To view the submit command for a specified job

-o SubmitLine

sacct -o SubmitLine -j 1000101

--format=SubmitLine

This is a way of using the --format flag from above to see a print out of the command your entered to submit the specified job after the -j flag.

Expand
titleExpand to see an example of running this command, and example output
Code Block
[user11@mblog1 ~]$ sacct --format=SubmitLine -j 1000324
          SubmitLine 
-------------------- 
  sbatch main_job.sh 

WorkDir

To view the working directory used by the job to execute commands

-o WorkDir

sacct -o WorkDir -j 1000101

--format=WorkDir

Expand
titleExpand to see an example of running this command, and example output
Code Block
[user11@mblog1 ~]$ sacct --format=WorkDir -j 1000324
          WorkingDir 
-------------------- 
  /project/deeplearnlab/ 

My Job Failed. What Do these Exit Codes Mean?

Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:

Exit Code

Likely Cause

0

The job ran successfully

Any non-zero value

The job failed in some form or another

1

A general failure

2

Something was wrong with a shell command in the script

3 and above

Job error associated with software commands (check software specific exit codes)

0:9

The job was cancelled (usually the user or Slurm/System)

0:15

The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime)

0:53

Some file or directory referenced in the script was not readable or writable

0:125

Job ran out of memory

Anything else

Contact arcc-help@uwyo.edu to have us investigate

** you can also runsacct --helpto get a comprehensive list of flags available to run with the sacct command

SINFO: Get information about cluster nodes and partitions

The default sinfo command: This print a list of all partitions, their states, availability, and associated nodes on the cluster

Expand
titleExpand to see an example of running the default sinfo command and it's output, with no flags or arguments
Code Block
[user1@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      1    mix mbcpu-007
mb*             up 7-00:00:00     24  alloc mbcpu-[001-006,008-025]
mb-a30          up 7-00:00:00    

...

  1  maint 

...

mba30-008
mb-a30      

...

    up 7-00:00:00  

...

    3  

...

  mix mba30-[002,004,006]
mb-a30  

...

        up 7-00:00:00  

...

    1 

...

 alloc mba30-005
mb-a30 

...

      

...

   up 7-00:00:00      3   idle 

...

mba30-[001,003,007]
mb-l40s         

...

up 7-00:00:00      1 

...

 maint vl40s-002
mb-l40s    

...

     

...

up 7-00:00:00      

...

1   resv mbl40s-004
mb-l40s      

...

   up 

...

7-00:00:00    

...

  3    

...

mix mbl40s-[001-003]
mb-l40s         up 7-00:00:00   

...

   1   idle mbl40s-007
mb-h100  

...

       up 7-00:00:00      

...

1 drain$ mbh100-004
mb-h100   

...

      

...

up 7-00:00:00      4    

...

mix mbh100-[001-003,005]
mb-a6000   

...

    

...

 up 7-00:00:00   

...

   1    mix mba6000-001
wildiris     

...

   up 7-00:00:00 

...

     5  

...

 idle 

...

wi[001-005]
teton   

...

      

...

  

...

up 7-00:00:00     

...

 1  drain t286

...

teton    

...

     

...

  up 7-00:00:00      3    

...

mix t[460,502,507]
teton    

...

       up 7-00:00:00     24 

...

  idle t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02],ttest[01-02]
beartooth  

...

     

...

up 7-00:00:00      1   idle b523

...

inv-arcc     

...

   

...

up   infinite  

...

    1  alloc mbcpu-025
inv-arcc      

...

  up   infinite     

...

 2   idle ttest[01-02]
inv-inbre 

...

      

...

up 7-00:00:00     

...

 1  drain t286

...

inv-inbre      

...

 up 7-00:00:00   

...

   2    

...

mix t[502,507]
inv-inbre   

...

    up 7-00:00:00      1  alloc 

...

mbcpu-009
inv-inbre       up 7-00:00:00 

...

    24   

...

idle b523,mbl40s-007,t[285,287-296,501,503-506,508],thm[03-05],tmass[01-02]
inv-ssheshap    

...

up 7-00:00:00     

...

 1    

...

mix mba6000-001
inv-wysbc       

...

up 7-00:00:00      1  alloc mbcpu-001
inv-wysbc 

...

      up 7-00:00:00      1 

...

  idle mba30-001
inv-soc  

...

       up 7-00:00:00      

...

1    mix 

...

mbl40s-001
inv-wildiris    

...

up 7-00:00:00      5   

...

idle wi[001-005]
inv-klab    

...

    up 7-00:00:00   

...

   3   

...

 mix mba30-[002,004],mbcpu-007
inv-klab        up 7-00:00:00   

...

   6  

...

alloc mba30-005,mbcpu-[002-006]
inv-klab      

...

  up 7-00:00:00    

...

  1   idle 

...

mba30-003
inv-dale      

...

  up 7-00:00:00  

...

    1  alloc 

...

mbcpu-008
inv-wsbc      

...

  

...

SEFF: Analyze the efficiency of a completed job with seff

Below will just provide a short breakdown for using the seff command. Please see this page for a great and detailed description of how one could evaluate their job’s performance and efficiency.

The seff command will provide information about cpu and memory efficiency of your job, when provided a valid job number as the argument with seff <job#>. This information is only accurate assuming the job has completed successfully. Any jobs that are still running, or that complete with an out-of-memory error or other errors will have inaccurate seff output.

Expand
titleExpand to view an example of using the seff command, and it's output
Code Block
[]$ seff 10001001
Job ID: 10001001
Cluster: Medicinebow
User/Group: jsmith/mycoolproject
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)

...

titleLinks to more information on SLURM commands

...

ARCCJOBS: Get a report of jobs currently running on the cluster

...

up 7-00:00:00      1    mix mba30-006
inv-wsbc        up 7-00:00:00      1  alloc mbcpu-010
non-investor    up 7-00:00:00      1    mix t460
non-investor    up 7-00:00:00     14  alloc mbcpu-[011-024]

Helpful flags when calling sinfo to tailor your query

Flag

Used this when

Short Form

Short Form Ex.

Long Form

Useful flag info, Long Form Example & Output

state

Shows any nodes in state(s) specified

-t

sinfo -t reserved

--states

The --states flag, will print the sinfo, listing nodes (if any) in the specified state and the number of nodes from each partition in the state. If none in a partition are in the state, the number of nodes will be 0 for that partition’s line.

Expand
titleExpand to see an example of

...

sinfo command run with --states flag, and output
Code Block
[jsmith@mblog1 ~]$ 

...

sinfo --states=mixed
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      0  

...

  n/a 
mb-a30          up 7-00:00:00      3 

...

   mix mba30-[002,004,006]
mb-l40s         up 7-00:00:00      

...

3    

...

mix mbl40s-[001-003]
mb-h100         up 7-00:00:00      4    

...

mix mbh100-[001-003,005]
mb-a6000        up 7-00:00:00      1 

...

   mix mba6000-001
wildiris    

...

    

...

up 7-00:00:00   

...

   0    n/a 

...

teton           up 7-00:00:00      3    

...

mix t[460,502,507]
beartooth     

...

  up 7-00:00:00      

...

0    n/a 
inv-arcc 

...

       

...

up   infinite      0

...

   

...

 n/a 
inv-inbre       up 7-00:00:00      2  

...

  mix t[502,507]
inv-ssheshap   

...

 up 7-00:00:00      

...

1

...

    mix mba6000-001
inv-wysbc 

...

      

...

up 7-00:00:00      0

...

 

...

   n/a 
inv-soc         up 7-00:00:00      1   

...

 mix mbl40s-001
inv-wildiris    

...

up 7-00:00:00      0 

...

   n/a 
inv-klab  

...

      

...

up 7-00:00:00      2   

...

 mix 

...

mba30-[002,004]
inv-dale        up 7-00:00:00      

...

0    n/a 
inv-wsbc 

...

       up 

...

7-00:00:00      1  

...

  mix mba30-006
non-investor   

...

 

...

up 7-00:00:00      

...

1    

...

mix t460

format

To get sinfo printout with specified format & output

-O

sinfo -O NodeAddr,AllocatedMem,Cores

--Format

If appended with the --Format flag, sinfo info is given using specified format & output. Format should be indicated using column names recognized by SLURM (hint: run sinfo --helpFormat to get a list of SLURM’s recognized column names)

Expand
titleExpand to see an example of squeue command run with --format flag, and output
Code Block
[user17@mblog1 ~]$ sinfo --Format="AllocMem,AllocNodes,Available,Cores,CPus,CPUsLoad,Disk,Gres,Nodes,Memory"
ALLOCMEM ALLOCNODES AVAIL   CORES   CPUS     CPU_LOAD    TMP_DISK  GRES       NODES  

...

  MEMORY   

...

      

...

     
886016 

...

    all  

...

     

...

up   

...

    48      96      90.25     

...

     

...

0     (null) 

...

       

...

1     1023575  

...

         

...

  

...

924576     all       up     

...

  48     

...

 96      

...

96.06-96.12    0   

...

  (null)    

...

    5 

...

  

...

  1023575             

...

511296     all 

...

      up  

...

     48  

...

    96   

...

   95.84      

...

   

...

 0     (null)        1    

...

 1023575     

...

        

...


393216     all 

...

      up 

...

      48   

...

   

...

96      96.45-96.56    0     (null)   

...

     2  

...

   1023575      

...

       

...


588096     all 

...

      up   

...

  

...

  48      96      89.97  

...

     

...

   0    

...

 (null)      

...

  1   

...

  1023575    

...

   

...

      
570336     all 

...

      up 

...

      48   

...

   96    

...

  

...

96.31-96.43    0     (null)    

...

   

...

 3     1023575       

...

     

...

 
629376     

...

all       

...

up     

...

  48    

...

 

...

 96      96.23-96.40    0     (null)       

...

 5   

...

  1023575     

...

     

...

   
514912 

...

    all 

...

 

...

 

...

    up       48      96     

...

 92.31     

...

 

...

    0     (null)        1   

...

  

...

1023575             
688416     all       up       48      96      96.33          0     (null)        1     1023575             
798304     all   

...

    up  

...

 

...

 

...

   48      96     

...

 93.

...

06         

...

ARCCQUOTA: Get a report of your common HPC data storage locations and usage

arccquota shows information relating to storage quotas. By default, this will display $HOME and $SCRATCH quotas first, followed by the user's associated project quotas. This is a change on Teton from Mount Moran, but the tool is much more comprehensive. The command takes arguments to do project-only (i.e., no $HOME or $SCRATCH info displayed), extensive listing of users' quotas and usage within project directories, can summarize quotas (i.e., no user-specific usage on project spaces).

...

titleExpand to view the default arccquota command and example output

...

 0     (null)        1     1023575             
857344     all       up       48      96      93.08          0     (null)        1     1023575             
865536     all       up       48      96      96.10-96.25    0     (null)        2     1023575             
806496     all       up       48      96      96.23          0     (null)        1     1023575             
102400     all       up       48      96      42.22          0     gpu:a30:8     1      765525              
208896     all       up       48      96      82.04          0     gpu:a30:8     1      765525              
524288     all       up       48      96      0.02           0     gpu:a30:8     1      765525              
49152      all       up       48      96      585.36         0     gpu:a30:8     1      765525              
0          all       up       48      96      0.00-0.02      0     gpu:a30:8     4      765525              
0          all       up       12      12      0.00           0     gpu:l40s:1    1       75469               
0          all       up       48      96      0.00           0     gpu:l40s:8    1      765525              
524288     all       up       48      96      4.41-5.24      0     gpu:l40s:8    2      765525              
262144     all       up       48      96      2.43           0     gpu:l40s:8    1      765525              
0          all       up       48      96      0.00           0     gpu:l40s:4    1      765525              
0          all       up       48      96      0.35           0     gpu:h100:8    1     1281554             
524288     all       up       48      96      0.26-12.20     0     gpu:h100:8    4     1281554             

...

262144     all       up 

...

      32      64  

...

    6.03           0     gpu:a6000:4   1  

...

   1023575             

...

0  

...

       

...

 all    

...

   up      

...

 14+     28+     0.00-0.01      0     (null)     

...

 

...

 

...

 30  

...

 

...

 119962+ 

...

     

...

 

...

 

...

     
0          all    

...

 

...

 

...

 up 

...

 

...

  

...

   28  

...

 

...

 

...

  56      

...

0.

...

00 

...

   

...

 

...

  

...

    0 

...

titleExpand to view the arccquota command querying a specified user and example output

...

    gpu:a30:2     1     1020129             
32768      all       up       16      32      15.17    

...

      0     (null)   

...

     1      128000  

...

            
30720 

...

     all       up       

...

20      40      2.00-2.02     

...

 

...

0     (null)  

...

     

...

 2      184907  

...

 

SEFF: Analyze the efficiency of a completed job with seff

Below will just provide a short breakdown for using the seff command. Please see this page for a great and detailed description of how one could evaluate their job’s performance and efficiency.

The seff command will provide information about cpu and memory efficiency of your job, when provided a valid job number as the argument with seff <job#>. This information is only accurate assuming the job has completed successfully. Any jobs that are still running, or that complete with an out-of-memory error or other errors will have inaccurate seff output.

Expand
titleExpand to view an example of using the seff command, and it's output
Code Block
[]$ seff 10001001
Job ID: 10001001
Cluster: Medicinebow
User/Group: jsmith/mycoolproject
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)
Expand
titleLinks to more information on SLURM commands
Insert excerpt
Slurm Workload Manager
Slurm Workload Manager
nameLink to Slurm info