HPC System and Job Queries
- 1 Overview: HPC Information and Compute Job Information
- 2 Common SLURM Commands
- 3 ARCCJOBS: Get a report of jobs currently running on the cluster
- 4 ARCCQUOTA: Get a report of your common HPC data storage locations and usage
Overview: HPC Information and Compute Job Information
System querying is helpful to understand what is happening with the system. Meaning, what compute jobs are running, storage quotas, job history, etc. This page contains commands and examples of how to find that information.
Common SLURM Commands
The following describes common SLURM commands and common flags you may want to include when running them. SLURM commands are often run with flags (appended to the command with --flag
) to stipulate specific information that should be included in output.
SQUEUE: Get information about running and queued jobs on the cluster with squeue
This command is used to pull up information about the jobs that currently exist in the SLURM queue. This command run as default will print all running and queued jobs on the cluster listing each job’s job ID, partition, username, job status, number of nodes, and a node list, with the name of the nodes allocated to each job:
Helpful flags when calling squeue
to tailor your query
Flag | Used this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
---|---|---|---|---|---|
me | To get a printout with just your jobs | n/a | n/a |
| The [jsmith@mblog1 ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000002 inv-lab2 AIML-CE jsmith R 6-13:02:32 1 mba30-004
1000005 inv-lab2 AIML-CE jsmith R 6-17:31:53 1 mba30-004
|
user | To get a printout of a specific user’s jobs |
|
|
| The [jsmith@mblog1 ~]$ squeue --user=joeblow
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000002 inv-lab2 AIML-CE joeblow R 6-13:02:32 1 mba30-004
1000005 inv-lab2 AIML-CE joeblow R 6-17:31:53 1 mba30-004
|
long | To get a printout of jobs including wall time | -l |
|
| The |
format | To get squeue printout with specified format & output |
|
|
| If appended with the |
** you can also run squeue --help
to get a comprehensive list of flags available to run with the squeue command
SACCT: Get information about recent or completed jobs on the cluster with sacct
The default sacct
command: This print a list of your recent or recently completed jobs
Helpful flags when calling sacct
to tailor your query
Flag | Use this when | Short Form | Short Form Ex. | Long Form | Useful flag info, Long Form Example & Output |
---|---|---|---|---|---|
job | To get info about specific job#(s) |
|
|
| |
batch script | To view batch / submission script for a specific job |
|
|
| You must specify a job with the |
user | To get a printout of a specific user’s jobs |
|
|
| The |
start | To get a printout of job(s) starting after a date/time |
|
|
| Dates and times should be specified with format |
end | To get a printout of job(s) ending before a given date/time |
|
|
| Dates and times should be specified with format
|
format | To get sacct printout with specified format & output |
|
|
| If appended with the |
submit line | To view the submit command for a specified job |
|
|
| This is a way of using the |
WorkDir | To view the working directory used by the job to execute commands |
|
|
|
My Job Failed. What Do these Exit Codes Mean?
Slurm records error codes in the form of numerical values that seem rather cryptic. While we don’t always know for sure why they’re caused without investigation, some causes are more likely than others. Exit codes usually consist of 2 sets of numbers (one before a colon and one after) or a single number. Common error codes and their likely causes are below:
Exit Code | Likely Cause |
---|---|
0 | The job ran successfully |
Any non-zero value | The job failed in some form or another |
1 | A general failure |
2 | Something was wrong with a shell command in the script |
3 and above | Job error associated with software commands (check software specific exit codes) |
0:9 | The job was cancelled (usually the user or Slurm/System) |
0:15 | The job was cancelled (usually because the user cancelled the job, or it ran over specified walltime) |
0:53 | Some file or directory referenced in the script was not readable or writable |
0:125 | Job ran out of memory |
Anything else | Contact arcc-help@uwyo.edu to have us investigate |
** you can also run sacct --help
to get a comprehensive list of flags available to run with the sacct command
SINFO: Get information about cluster nodes and partitions
The default sinfo
command: This print a list of all partitions, their states, availability, and associated nodes on the cluster
Helpful flags when calling sinfo
to tailor your query
Flag | < |
---|