Goal: Introduce some further features, such as job efficiency and cluster utilization.
Table of Contents |
---|
minLevel | 1 |
---|
maxLevel | 1 |
---|
outline | false |
---|
style | none |
---|
type | list |
---|
printable | true |
---|
|
...
Job Efficiency
Info |
---|
You can view the cpu and memory efficiency of a job using the seff command and providing a <job-id> . |
Code Block |
---|
[]$ seff 13515489
Job ID: 13515489
Cluster: beartooth<cluster-name>
User/Group: salexan5<username>/salexan5<username>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node) |
...
if the job is successful. If the job fails with say an OOM : Out-Of-Memory the details will be inaccurate. This is emailed out if you have Slurm email notifications turned on.
|
...
What’s the Current Cluster Utilization?
Info |
---|
There are a number of ways to see the current status of the cluster: arccjobs : Prints a table showing active projects and jobs.
pestat : Prints a node list with allocated jobs - can query individual nodes.
sinfo : View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.
OnDemand’s MedicineBow System Status page.
|
Expand |
---|
|
Code Block |
---|
[]$ arccjobs
===============================================================================
Account Running Pending
User jobs cpus cpuh jobs cpus cpuh
===============================================================================
eap-amadson 500 500 30.42 3 3 2.00
amadson 500 500 30.42 3 3 2.00
eap-larsko 1 32 2262.31 0 0 0.00
fghorban 1 32 2262.31 0 0 0.00
pcg-llps 2 64 1794.41 0 0 0.00
hbalantr 1 32 587.68 0 0 0.00
vvarenth 1 32 1206.73 0 0 0.00
===============================================================================
TOTALS: 503 596 4087.14 3 3 2.00
===============================================================================
Nodes 9/51 (17.65%)
Cores 596/4632 (12.87%)
Memory (GB) 2626/46952 ( 5.59%)
CPU Load 803.43 (17.35%)
=============================================================================== |
|
Expand |
---|
|
Code Block |
---|
[]$ pestat
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User ...
mba30-001 mb-a30 idle 0 96 0.00 765525 749441
mba30-002 mb-a30 idle 0 96 0.00 765525 761311
mba30-003 mb-a30 idle 0 96 0.00 765525 761189
...
mbl40s-004 mb-l40s idle 0 96 0.00 765525 761030
mbl40s-005 mb-l40s idle 0 96 0.00 765525 760728
mbl40s-007 mb-l40s idle 0 96 0.00 765525 761452
wi001 inv-wildiris idle 0 48 0.00 506997 505745
wi002 inv-wildiris idle 0 48 0.00 506997 505726
wi003 inv-wildiris idle 0 48 0.00 506997 505746
wi004 inv-wildiris idle 0 48 0.00 506997 505729
wi005 inv-wildiris idle 0 56 0.00 1031000 1020610 |
|
Expand |
---|
|
Code Block |
---|
# View overall cluster:
[]$ sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"
CPUS MEMORY GRES NODES(A/I/O/T) NODELIST
96 1023575 (null) 6/19/0/25 mbcpu-[001-025]
96 765525 gpu:a30:8 0/8/0/8 mba30-[001-008]
96 765525 gpu:l40s:8 1/4/0/5 mbl40s-[001-005]
96 765525 gpu:l40s:4 0/1/0/1 mbl40s-007
64 1023575 gpu:a6000:4 0/1/0/1 mba6000-001
48 506997 (null) 0/4/0/4 wi[001-004]
56 1031000 gpu:a30:2 0/1/0/1 wi005
96 1281554 gpu:h100:8 1/3/2/6 mbh100-[001-006]
# View a particular (investment) partition:
[]$ sinfo -p inv-wildiris
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
inv-wildiris up infinite 5 idle wi[001-005]
# View compute nodes currently drained:
[]$ sinfo -R
REASON USER TIMESTAMP NODELIST
HW Status: Unknown - slurm 2024-07-19T12:02:04 mbh100-001
Not responding slurm 2024-07-30T13:49:06 mbh100-006 |
|
...