Slurm: More Features

Goal: Introduce some further features, such as job efficiency and cluster utilization.


Job Efficiency

You can view the cpu and memory efficiency of a job using the seff command and providing a <job-id>.

[]$ seff 13515489 Job ID: 13515489 Cluster: <cluster-name> User/Group: <username>/<username> State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:05 CPU Efficiency: 27.78% of 00:00:18 core-walltime Job Wall-clock time: 00:00:18 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)

Note:

  • Only accurate if the job is successful.

  • If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.

  • This is emailed out if you have Slurm email notifications turned on.


What’s the Current Cluster Utilization?

There are a number of ways to see the current status of the cluster:

  • arccjobs: Prints a table showing active projects and jobs.

  • pestat: Prints a node list with allocated jobs - can query individual nodes.

  • sinfo: View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.

  • OnDemand’s MedicineBow System Status page.

[]$ arccjobs =============================================================================== Account Running Pending User jobs cpus cpuh jobs cpus cpuh =============================================================================== eap-amadson 500 500 30.42 3 3 2.00 amadson 500 500 30.42 3 3 2.00 eap-larsko 1 32 2262.31 0 0 0.00 fghorban 1 32 2262.31 0 0 0.00 pcg-llps 2 64 1794.41 0 0 0.00 hbalantr 1 32 587.68 0 0 0.00 vvarenth 1 32 1206.73 0 0 0.00 =============================================================================== TOTALS: 503 596 4087.14 3 3 2.00 =============================================================================== Nodes 9/51 (17.65%) Cores 596/4632 (12.87%) Memory (GB) 2626/46952 ( 5.59%) CPU Load 803.43 (17.35%) ===============================================================================
[]$ pestat Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User ... mba30-001 mb-a30 idle 0 96 0.00 765525 749441 mba30-002 mb-a30 idle 0 96 0.00 765525 761311 mba30-003 mb-a30 idle 0 96 0.00 765525 761189 ... mbl40s-004 mb-l40s idle 0 96 0.00 765525 761030 mbl40s-005 mb-l40s idle 0 96 0.00 765525 760728 mbl40s-007 mb-l40s idle 0 96 0.00 765525 761452 wi001 inv-wildiris idle 0 48 0.00 506997 505745 wi002 inv-wildiris idle 0 48 0.00 506997 505726 wi003 inv-wildiris idle 0 48 0.00 506997 505746 wi004 inv-wildiris idle 0 48 0.00 506997 505729 wi005 inv-wildiris idle 0 56 0.00 1031000 1020610