Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduce some further features, such as job efficiency and cluster utilization.

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

...

Job Efficiency

Info

You can view the cpu and memory efficiency of a job using the seff command and providing a <job-id>.

Code Block
[]$ seff 13515489
Job ID: 13515489
Cluster: beartooth<cluster-name>
User/Group: salexan5<username>/salexan5<username>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)
Info

Note:

  • Only accurate

...

  • if the job is successful.

  • If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.

  • This is emailed out if you have Slurm email notifications turned on.

...

What’s the Current Cluster Utilization?

Info

There are a number of ways to see the current status of the cluster:

  • arccjobs: Prints a table showing active projects and jobs.

  • pestat: Prints a node list with allocated jobs - can query individual nodes.

  • sinfo: View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.

  • OnDemand’s MedicineBow System Status page.

Expand
titlearccjobs example
Code Block
[]$ arccjobs
===============================================================================
Account                         Running                      Pending
  User                   jobs    cpus         cpuh    jobs    cpus         cpuh
===============================================================================
eap-amadson               500     500        30.42       3       3         2.00
  amadson                 500     500        30.42       3       3         2.00

eap-larsko                  1      32      2262.31       0       0         0.00
  fghorban                  1      32      2262.31       0       0         0.00

pcg-llps                    2      64      1794.41       0       0         0.00
  hbalantr                  1      32       587.68       0       0         0.00
  vvarenth                  1      32      1206.73       0       0         0.00

===============================================================================
TOTALS:                   503     596      4087.14       3       3         2.00
===============================================================================
Nodes                       9/51      (17.65%)
Cores                     596/4632    (12.87%)
Memory (GB)              2626/46952   ( 5.59%)
CPU Load               803.43         (17.35%)
===============================================================================
Expand
titlepestat example
Code Block
[]$ pestat
Hostname          Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                               State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mba30-001            mb-a30    idle    0  96    0.00    765525   749441
mba30-002            mb-a30    idle    0  96    0.00    765525   761311
mba30-003            mb-a30    idle    0  96    0.00    765525   761189
...
mbl40s-004          mb-l40s    idle    0  96    0.00    765525   761030
mbl40s-005          mb-l40s    idle    0  96    0.00    765525   760728
mbl40s-007          mb-l40s    idle    0  96    0.00    765525   761452
wi001          inv-wildiris    idle    0  48    0.00    506997   505745
wi002          inv-wildiris    idle    0  48    0.00    506997   505726
wi003          inv-wildiris    idle    0  48    0.00    506997   505746
wi004          inv-wildiris    idle    0  48    0.00    506997   505729
wi005          inv-wildiris    idle    0  56    0.00   1031000  1020610
Expand
titlesinfo examples:
Code Block
# View overall cluster:
[]$ sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"
CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
96      1023575  (null)        6/19/0/25       mbcpu-[001-025]
96      765525   gpu:a30:8     0/8/0/8         mba30-[001-008]
96      765525   gpu:l40s:8    1/4/0/5         mbl40s-[001-005]
96      765525   gpu:l40s:4    0/1/0/1         mbl40s-007
64      1023575  gpu:a6000:4   0/1/0/1         mba6000-001
48      506997   (null)        0/4/0/4         wi[001-004]
56      1031000  gpu:a30:2     0/1/0/1         wi005
96      1281554  gpu:h100:8    1/3/2/6         mbh100-[001-006]

# View a particular (investment) partition:
[]$ sinfo -p inv-wildiris
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
inv-wildiris    up   infinite      5   idle wi[001-005]

# View compute nodes currently drained:
[]$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
HW Status: Unknown - slurm     2024-07-19T12:02:04 mbh100-001
Not responding       slurm     2024-07-30T13:49:06 mbh100-006

...