Overview: What is Performance

In it’s simplest form there are three performance metrics any user can consider:

How much the core(s) were utilized?
How much memory was used?
How long the job took?
How much read/write was performed? This isn’t currently covered on this page.

Check Overall Job Efficiency

To get a general sense on how a job has performed you can use the following two methods:

Command: seff

From the command line use the seff command to display basic job performance:

[]$ seff 12347496
Job ID: 12347496
Cluster: teton
User/Group: salexan5/salexan5
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 40
CPU Utilized: 3-23:17:34
CPU Efficiency: 99.25% of 4-00:00:40 core-walltime
Job Wall-clock time: 02:24:01
Memory Utilized: 102.99 MB
Memory Efficiency: 0.13% of 80.00 GB

There is the -d option that also displays the raw data.

[]$ seff -d 12347496
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 12347496  salexan5 salexan5 COMPLETED teton 40 1 1 83886080 1 343054 8641 105460 0

Job ID: 12347496
Cluster: teton
User/Group: salexan5/salexan5
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 40
CPU Utilized: 3-23:17:34
CPU Efficiency: 99.25% of 4-00:00:40 core-walltime
Job Wall-clock time: 02:24:01
Memory Utilized: 102.99 MB
Memory Efficiency: 0.13% of 80.00 GB

These details are only accurate if the job successfully completed.

Email

Depending on how many jobs you have running you can use the slurm email options. By adding the following two lines to your bash script, a mini report will be emailed out to you when the state of a job changes, this can include started, pre-empted, finished. When a job has finished, then the results that you can retrieve using the seff command are emailed out to you. Obviously, if you are submitting 100s/1000s of jobs then this can be impractical.

#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-address>

Command: sacct

The sacct command displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.

If you want more details that provided by the seff command, then sacct provides a lot more fields.

Calling sacct -e will list all fields available. Please refer to the slurm page for descriptions.

[salexan5@tlog1 teton-cascade]$ sacct -e
Account             AdminComment        AllocCPUS           AllocGRES
...

Example

[]$ sacct --format="JobID%20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS,AveRSS,AveCPU" -j 12347496
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS     AveRSS     AveCPU
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ----------
            12347496    vsearch  salexan5 teton-cas+            t465   02:24:01  COMPLETED      0:0
      12347496.batch      batch                                 t465   02:24:01  COMPLETED      0:0    105460K    105460K 3-23:17:32

Starting to Interpret Jobs

Computation

Here is a contrived example, but it starts to demonstrate some of the ideas to start looking out for. I know the application I’m running only invokes four threads, so in my slurm script I only need to request cpus-per-task=4. If I submit the job and then look at the seff and equivalent sacct field details:

[]$ seff 12348706
Job ID: 12348706
Cluster: teton
User/Group: salexan5/salexan5
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:01:36
CPU Efficiency: 88.89% of 00:01:48 core-walltime
Job Wall-clock time: 00:00:27
Memory Utilized: 2.63 MB
Memory Efficiency: 0.07% of 3.91 GB

[]$ sacct --format="JobID%20,NCPUS,Timelimit,Elapsed,CPUTime,SystemCPU,UserCPU,TotalCPU" -j 12348706
               JobID      NCPUS  Timelimit    Elapsed    CPUTime  SystemCPU    UserCPU   TotalCPU
-------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
            12348706          4   00:05:00   00:00:27   00:01:48  00:00.005  01:35.691  01:35.697
      12348706.batch          4              00:00:27   00:01:48  00:00.005  01:35.691  01:35.697

Using the definitions as defined is the sacct documentation:

NCPUS: Total number of CPUs allocated to the job.
Timelimit: What the timelimit was/is for the job.
Elapsed: The jobs elapsed time.
CPUTime: Time used (Elapsed time * CPU count) by a job or step in HH:MM:SS format.
SystemCPU: The amount of system CPU time used by the job or job step.
UserCPU: The amount of user CPU time used by the job or job step.
TotalCPU: The sum of the SystemCPU and UserCPU time used by the job or job step.

So, we can read this as the job:

took 27 seconds to complete.
had a cpu time of 4 x 27 = 1:48
had a total cpu time (how long the cpus were used for) of 1:36
Comparing: 1:36 is 88.89% of 1:48 which can be interpreted as our CPUs were used for 88.89% of the time - which is pretty efficient.

Now, if I rerun the job, but this time set cpus-per-task=16 I get the following results:

[]$ seff 12348707
Job ID: 12348707
Cluster: teton
User/Group: salexan5/salexan5
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:01:27
CPU Efficiency: 24.72% of 00:05:52 core-walltime
Job Wall-clock time: 00:00:22
Memory Utilized: 2.63 MB
Memory Efficiency: 0.02% of 15.62 GB

[salexan5@tlog1 posix]$ sacct --format="JobID%20,NCPUS,Timelimit,Elapsed,CPUTime,SystemCPU,UserCPU,TotalCPU" -j 12348707
               JobID      NCPUS  Timelimit    Elapsed    CPUTime  SystemCPU    UserCPU   TotalCPU
-------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
            12348707         16   00:05:00   00:00:22   00:05:52  00:00.008  01:26.810  01:26.819
      12348707.batch         16              00:00:22   00:05:52  00:00.008  01:26.810  01:26.819

The main thing to notice here is that the CPU efficiency has drastically dropped to 24.72%! Why?

Remember, I know that only four cores are actually being used, but I requested 16, so in this case 12 of the cores are not being used, at all.

But what if you didn’t know? This should indicate to you to start questioning, and thus exploring, how does the application use multiple cores. Why is this worth considering?

User’s typically run 10s/100s/1000s of jobs. If you requested a standard teton node that has 32 cores, you could have only two jobs (when using cpus-per-task=16). But if we know only four cores are actually needed we can define cpus-per-task=4 and actually have 8 jobs potentially running on a single node. The first case is a wasteful request for resources, and when the cluster is being heavily used you might find yourself having jobs pending/waiting. The latter case is significantly more efficient, and your batch of jobs will be finished quicker.
There are still a large number of applications and commands that do not use multiple cores or threads. They are single core, single threaded. Even if you request 16 cores, unless the application is programmed to use them all, it’ll still only use one, with 15 not being used. Always read the documentation or man/help for the application/command and check if it can use multiple cores. Typically there’ll be a command option that allows you to use multiple threads, but by default this is often set to 1.
If you are not sure then please ask us as this is one of the services we offer in helping the researcher to understand their application and whether it is/is not multi-node/multi-core/multi-threaded enabled. Once we know we’ll share this via the appropriate page to the wider community.

Memory

Lets look at the memory utilization of the two jobs:

[]$ seff 12348706
Memory Utilized: 2.63 MB
Memory Efficiency: 0.07% of 3.91 GB

[]$ sacct --format="JobID%20,NCPUS,ReqMEM,MaxRSS,AveRSS,AllocTRES%40" -j 12348706
               JobID      NCPUS     ReqMem     MaxRSS     AveRSS                                AllocTRES
-------------------- ---------- ---------- ---------- ---------- ----------------------------------------
            12348706          4     1000Mc                               billing=4,cpu=4,mem=4000M,node=1
      12348706.batch          4     1000Mc      2696K      2696K                   cpu=4,mem=4000M,node=1

[]$ seff 12348707
Memory Utilized: 2.63 MB
Memory Efficiency: 0.02% of 15.62 GB

[]$ sacct --format="JobID%20,NCPUS,ReqMEM,MaxRSS,AveRSS,AllocTRES%40" -j 12348707
               JobID      NCPUS     ReqMem     MaxRSS     AveRSS                                AllocTRES
-------------------- ---------- ---------- ---------- ---------- ----------------------------------------
            12348707         16     1000Mc                            billing=16,cpu=16,mem=16000M,node=1
      12348707.batch         16     1000Mc      2692K      2692K                 cpu=16,mem=16000M,node=1

Using the definitions as defined in the sacct documentation:

MaxRSS: Maximum resident set size of all tasks in job.
AveRSS: Average resident set size of all tasks in job.
ReqMem: Minimum required memory for the job, in MB. A 'c' at the end of number represents Memory Per CPU, a 'n' represents Memory Per Node.
AllocTres: Trackable resources. These are the resources allocated to the job/step after the job started running.

In our slurm script we didn’t define any memory requirements, so we were allocated the default of 1000M per cpu. So we can calculate the total memory required for the jobs as 4 * 1000M = 3.91GB and 16 * 1000M = 15.62GB. The MaxRSS / AveRSS values provide an indication of what was actually used.

I know for this example that no large amounts of memory are allocated. So, in both cases very little of the 3.91GB and 15.62GB that was allocated was actually used. Why is this worth considering?

Again, it goes back to efficient resource allocation. A teton node typically has 128G, of which probably 124GB is actually available (remember the node needs to use memory to run the OS). If I’d requested four cpus per job, but asked for 60GB, then only two jobs would be able to fit onto the teton node (2 x 60 = 120 < 124 while 3 x 60 = 180 > 124). So, even though we know we could use more cores on that 32 core node, we can’t because we’re being limited by our memory request.
One of the most common issues for jobs failing is due to Out-Of-Memory i.e. the application has tried to use more memory than the job has been allocated. What we often see is that a researcher will start with a small data set that fits within a certain memory allocation. Their data set gets larger but they don’t request more memory. Or maybe the configuration for the simulation is changed e.g. some mesh is made finer requiring more data points to be calculated and stored in memory. In both cases if the researcher notes and tracks the memory usage they are aware when more memory is required within an allocation.
Now, some researchers use mem=0 which requests ALL the memory on a node. Although there are cases when this is required, please note this essentially gives the user the entire node, and if they’re only using say 4 cores, then there are 28 cores that are not available. If the job only uses a fraction of the total memory required then this is a wasteful resource request and will effect not only how quickly their batch of jobs can be allocated and run, but also everyone else! We have a fair-share policy on teton and request users to try and request appropriate resources.

These are two basic examples that introduce you to some of the reasons why it’s important and useful to track and analyze your job efficiency.

Time

Lets look at the time requested, and the actual time taken for the two jobs:

[]$ seff 12348706
Job Wall-clock time: 00:00:27

[]$ sacct --format="JobID%20,NCPUS,Timelimit,Elapsed" -j 12348706
               JobID      NCPUS  Timelimit    Elapsed
-------------------- ---------- ---------- ----------
            12348706          4   00:05:00   00:00:27
      12348706.batch          4              00:00:27

[]$ seff 12348707
Job Wall-clock time: 00:00:22

[salexan5@tlog1 posix]$ sacct --format="JobID%20,NCPUS,Timelimit,Elapsed" -j 12348707
               JobID      NCPUS  Timelimit    Elapsed
-------------------- ---------- ---------- ----------
            12348707         16   00:05:00   00:00:22
      12348707.batch         16              00:00:22

Although 5 minutes was requested for both, they both took about the same time 27 and 22 seconds. Why is this worth considering?

First, it can again highlight the multi-code aspects of a job. Remember the second job requested 16 cores, but took about the same time as the job with only 4 cores, this should make you question whether or not this example is multi-core or not.
The second can affect the queuing of the jobs. Slurm job queuing and allocation can be a mystical black box considering an array of job resources, with time being one. Remember, slurm can only go on what time you have requested, it can not predict how long it will actually take. One of the common consequences is when the the cluster is under maintenance.

Consider the following case: I’m going to submit a job on the 24th at noon, with time=7:00:00:00 (7 days) but I’m aware that the cluster is under maintenance on the 28th starting at midnight. Since we can see that the cluster will not be available in three and a half days, our seven day job can not fit within this window, and thus will be added to the queue in a pending state, as demonstrated in the squeue results below.

 squeue | grep salexan5
          12348708 moran,tet test     salexan5 PD       0:00      1 (ReqNodeNotAvail, Reserved for maintenance)

Alternative: Command: Time

There are alternatives to using the slurm commands such as using the Linux time command into your script, and/or bash commands as demonstrated in the following examples:

Bash: time

Here is a bash time example:

#!/bin/bash
...
start=$(date +'%D %T')
echo "Start:" $start

time <run your application>

end=$(date +'%D %T')
echo "End:" $end

start_secs=$(date --date="$start" '+%s')
end_secs=$(date --date="$end"   '+%s')
duration=$((end_secs - start_secs))
echo "Duration:" $duration"sec"
echo "Done."

Using the above, within your output you will see something of the form:

Start: 01/27/21 21:10:39
...
real    0m21.951s
user    1m26.882s
sys     0m0.001s
End: 01/27/21 21:11:01
Duration: 22sec
Done.

System: time

The alternative is to use the system’s time command. Have a read of the man time page for more details, and look at the -v option depending on how much detail you want when using the command:

[]$ man time
TIME(1)                       Linux User’s Manual                      TIME(1)

NAME
       time - time a simple command or give resource usage

SYNOPSIS
       time [options] command [arguments...]

DESCRIPTION
       The  time  command  runs the specified program command with the given arguments.  When command finishes, time writes a message to standard error giving
       timing statistics about this program run.  These statistics consist of (i) the elapsed real time between invocation and termination, (ii) the user  CPU
       time  (the sum of the tms_utime and tms_cutime values in a struct tms as returned by times(2)), and (iii) the system CPU time (the sum of the tms_stime
       and tms_cstime values in a struct tms as returned by times(2)).

       Note: some shells (e.g., bash(1)) have a built-in time command that provides less functionality than the command described here.  To  access  the  real
       command, you may need to specify its pathname (something like /usr/bin/time).
...

The only change in the above example is the following:

/usr/bin/time ./a.out
or
/usr/bin/time -v ./a.out

Resulting in:

95.21user 0.00system 0:23.83elapsed 399%CPU (0avgtext+0avgdata 728maxresident)k
0inputs+0outputs (0major+276minor)pagefaults 0swaps
or
        Command being timed: "./a.out"
        User time (seconds): 95.80
        System time (seconds): 0.00
        Percent of CPU this job got: 399%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 740
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 279
        Voluntary context switches: 6
        Involuntary context switches: 116
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Check Running Jobs

Command: sstat

Similar to sacct, the sstat command display various status information of a job/step while it is actually running (sacct is more post completion).

It is a command-line tool, but scripts can be developed that call the command on a regular interval while a job is running and you can monitor various key metrics such as CPU frequency, max and average RSS, as well as disk reads and rights.

Note:

This command does not directly call the compute nodes a job is running on.
It makes a call to the central slurm controller, and if this is bombarded by too many calls too quickly it will effect the performance of slurm across the cluster. In fact the documentation does state: “Do not run sstat or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to sstat to the minimum necessary for the information you are trying to gather.”. So, please use appropriately and with consideration.