/
Slurm: More Features

Slurm: More Features

Goal: Introduce some further features, such as job efficiency and cluster utilization.


Job Efficiency

You can view the cpu and memory efficiency of a job using the seff command and providing a <job-id>.

[]$ seff 13515489 Job ID: 13515489 Cluster: <cluster-name> User/Group: <username>/<username> State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:05 CPU Efficiency: 27.78% of 00:00:18 core-walltime Job Wall-clock time: 00:00:18 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)

Note:

  • Only accurate if the job is successful.

  • If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.

  • This is emailed out if you have Slurm email notifications turned on.


What’s the Current Cluster Utilization?

There are a number of ways to see the current status of the cluster:

  • arccjobs: Prints a table showing active projects and jobs.

  • pestat: Prints a node list with allocated jobs - can query individual nodes.

  • sinfo: View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.

  • OnDemand’s MedicineBow System Status page.

[]$ arccjobs =============================================================================== Account Running Pending User jobs cpus cpuh jobs cpus cpuh =============================================================================== eap-amadson 500 500 30.42 3 3 2.00 amadson 500 500 30.42 3 3 2.00 eap-larsko 1 32 2262.31 0 0 0.00 fghorban 1 32 2262.31 0 0 0.00 pcg-llps 2 64 1794.41 0 0 0.00 hbalantr 1 32 587.68 0 0 0.00 vvarenth 1 32 1206.73 0 0 0.00 =============================================================================== TOTALS: 503 596 4087.14 3 3 2.00 =============================================================================== Nodes 9/51 (17.65%) Cores 596/4632 (12.87%) Memory (GB) 2626/46952 ( 5.59%) CPU Load 803.43 (17.35%) ===============================================================================
[]$ pestat Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User ... mba30-001 mb-a30 idle 0 96 0.00 765525 749441 mba30-002 mb-a30 idle 0 96 0.00 765525 761311 mba30-003 mb-a30 idle 0 96 0.00 765525 761189 ... mbl40s-004 mb-l40s idle 0 96 0.00 765525 761030 mbl40s-005 mb-l40s idle 0 96 0.00 765525 760728 mbl40s-007 mb-l40s idle 0 96 0.00 765525 761452 wi001 inv-wildiris idle 0 48 0.00 506997 505745 wi002 inv-wildiris idle 0 48 0.00 506997 505726 wi003 inv-wildiris idle 0 48 0.00 506997 505746 wi004 inv-wildiris idle 0 48 0.00 506997 505729 wi005 inv-wildiris idle 0 56 0.00 1031000 1020610

ARCC Related Usage Scripts


 

 

Related content

Slurm Workshop: Summary
Slurm Workshop: Summary
More like this
What is Slurm
What is Slurm
More like this
Slurm Workload Manager
Slurm Workload Manager
More like this
HPC System and Job Queries
HPC System and Job Queries
More like this
MedicineBow Hardware Summary Table
MedicineBow Hardware Summary Table
Read with this
Slurm Job Performance
Slurm Job Performance
More like this