Goal: Introduce some further features, such as job efficiency and cluster utilization.

Job Efficiency

[]$ seff 13515489
Job ID: 13515489
Cluster: <cluster-name>
User/Group: <username>/<username>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)

Note:

Only accurate if the job is successful.
If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.
This is emailed out if you have Slurm email notifications turned on.

What’s the Current Cluster Utilization?

There are a number of ways to see the current status of the cluster:

arccjobs: Prints a table showing active projects and jobs.
pestat: Prints a nodes list with allocated jobs - can query individual nodes.
sinfo: View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.
OnDemand’s MedicineBow System Status page.

arccjobs example

[]$ arccjobs
===============================================================================
Account                         Running                      Pending
  User                   jobs    cpus         cpuh    jobs    cpus         cpuh
===============================================================================
eap-amadson               500     500        30.42       3       3         2.00
  amadson                 500     500        30.42       3       3         2.00

eap-larsko                  1      32      2262.31       0       0         0.00
  fghorban                  1      32      2262.31       0       0         0.00

pcg-llps                    2      64      1794.41       0       0         0.00
  hbalantr                  1      32       587.68       0       0         0.00
  vvarenth                  1      32      1206.73       0       0         0.00

===============================================================================
TOTALS:                   503     596      4087.14       3       3         2.00
===============================================================================
Nodes                       9/51      (17.65%)
Cores                     596/4632    (12.87%)
Memory (GB)              2626/46952   ( 5.59%)
CPU Load               803.43         (17.35%)
===============================================================================

pestat example

[]$ pestat
Hostname          Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                               State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mba30-001            mb-a30    idle    0  96    0.00    765525   749441
mba30-002            mb-a30    idle    0  96    0.00    765525   761311
mba30-003            mb-a30    idle    0  96    0.00    765525   761189
...
mbl40s-004          mb-l40s    idle    0  96    0.00    765525   761030
mbl40s-005          mb-l40s    idle    0  96    0.00    765525   760728
mbl40s-007          mb-l40s    idle    0  96    0.00    765525   761452
wi001          inv-wildiris    idle    0  48    0.00    506997   505745
wi002          inv-wildiris    idle    0  48    0.00    506997   505726
wi003          inv-wildiris    idle    0  48    0.00    506997   505746
wi004          inv-wildiris    idle    0  48    0.00    506997   505729
wi005          inv-wildiris    idle    0  56    0.00   1031000  1020610

sinfo examples:

# View overall cluster:
[]$ sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"
CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
96      1023575  (null)        6/19/0/25       mbcpu-[001-025]
96      765525   gpu:a30:8     0/8/0/8         mba30-[001-008]
96      765525   gpu:l40s:8    1/4/0/5         mbl40s-[001-005]
96      765525   gpu:l40s:4    0/1/0/1         mbl40s-007
64      1023575  gpu:a6000:4   0/1/0/1         mba6000-001
48      506997   (null)        0/4/0/4         wi[001-004]
56      1031000  gpu:a30:2     0/1/0/1         wi005
96      1281554  gpu:h100:8    1/3/2/6         mbh100-[001-006]

# View a particular (investment) partition:
[]$ sinfo -p inv-wildiris
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
inv-wildiris    up   infinite      5   idle wi[001-005]

# View compute nodes currently drained:
[]$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
HW Status: Unknown - slurm     2024-07-19T12:02:04 mbh100-001
Not responding       slurm     2024-07-30T13:49:06 mbh100-006

Prev

Slurm: Workflows and Best Practices

Workshop Home

Intro to Job Scheduling

Next

Slurm: Common Issues and How to Resolve

Slurm: More Features

Job Efficiency

What’s the Current Cluster Utilization?