Slurm Job Performance

Overview: What is Performance

In it’s simplest form there are three performance metrics any user can consider:

  1. How much the core(s) were utilized?

  2. How much memory was used?

  3. How long the job took?

  4. How much read/write was performed? This isn’t currently covered on this page.

Check Overall Job Efficiency

To get a general sense on how a job has performed you can use the following two methods:

Command: seff

From the command line use the seff command to display basic job performance:

[]$ seff 12347496 Job ID: 12347496 Cluster: teton User/Group: salexan5/salexan5 State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 40 CPU Utilized: 3-23:17:34 CPU Efficiency: 99.25% of 4-00:00:40 core-walltime Job Wall-clock time: 02:24:01 Memory Utilized: 102.99 MB Memory Efficiency: 0.13% of 80.00 GB

There is the -d option that also displays the raw data.

[]$ seff -d 12347496 Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus Slurm data: 12347496 salexan5 salexan5 COMPLETED teton 40 1 1 83886080 1 343054 8641 105460 0 Job ID: 12347496 Cluster: teton User/Group: salexan5/salexan5 State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 40 CPU Utilized: 3-23:17:34 CPU Efficiency: 99.25% of 4-00:00:40 core-walltime Job Wall-clock time: 02:24:01 Memory Utilized: 102.99 MB Memory Efficiency: 0.13% of 80.00 GB

These details are only accurate if the job successfully completed.

Email

Depending on how many jobs you have running you can use the slurm email options. By adding the following two lines to your bash script, a mini report will be emailed out to you when the state of a job changes, this can include started, pre-empted, finished. When a job has finished, then the results that you can retrieve using the seff command are emailed out to you. Obviously, if you are submitting 100s/1000s of jobs then this can be impractical.

#SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address>

Command: sacct

The sacct command displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.

If you want more details that provided by the seff command, then sacct provides a lot more fields.

Calling sacct -e will list all fields available. Please refer to the slurm page for descriptions.

Example

Starting to Interpret Jobs

Computation

Here is a contrived example, but it starts to demonstrate some of the ideas to start looking out for. I know the application I’m running only invokes four threads, so in my slurm script I only need to request cpus-per-task=4. If I submit the job and then look at the seff and equivalent sacct field details:

 

Using the definitions as defined is the sacct documentation:

  • NCPUS: Total number of CPUs allocated to the job.

  • Timelimit: What the timelimit was/is for the job.

  • Elapsed: The jobs elapsed time.

  • CPUTime: Time used (Elapsed time * CPU count) by a job or step in HH:MM:SS format.

  • SystemCPU: The amount of system CPU time used by the job or job step.

  • UserCPU: The amount of user CPU time used by the job or job step.

  • TotalCPU: The sum of the SystemCPU and UserCPU time used by the job or job step.

So, we can read this as the job:

  • took 27 seconds to complete.

  • had a cpu time of 4 x 27 = 1:48

  • had a total cpu time (how long the cpus were used for) of 1:36

  • Comparing: 1:36 is 88.89% of 1:48 which can be interpreted as our CPUs were used for 88.89% of the time - which is pretty efficient.

Now, if I rerun the job, but this time set cpus-per-task=16 I get the following results:

The main thing to notice here is that the CPU efficiency has drastically dropped to 24.72%! Why?

Remember, I know that only four cores are actually being used, but I requested 16, so in this case 12 of the cores are not being used, at all.

But what if you didn’t know? This should indicate to you to start questioning, and thus exploring, how does the application use multiple cores. Why is this worth considering?

  • User’s typically run 10s/100s/1000s of jobs. If you requested a standard teton node that has 32 cores, you could have only two jobs (when using cpus-per-task=16). But if we know only four cores are actually needed we can define cpus-per-task=4 and actually have 8 jobs potentially running on a single node. The first case is a wasteful request for resources, and when the cluster is being heavily used you might find yourself having jobs pending/waiting. The latter case is significantly more efficient, and your batch of jobs will be finished quicker.

  • There are still a large number of applications and commands that do not use multiple cores or threads. They are single core, single threaded. Even if you request 16 cores, unless the application is programmed to use them all, it’ll still only use one, with 15 not being used. Always read the documentation or man/help for the application/command and check if it can use multiple cores. Typically there’ll be a command option that allows you to use multiple threads, but by default this is often set to 1.

  • If you are not sure then please ask us as this is one of the services we offer in helping the researcher to understand their application and whether it is/is not multi-node/multi-core/multi-threaded enabled. Once we know we’ll share this via the appropriate page to the wider community.

Memory

Lets look at the memory utilization of the two jobs:

Using the definitions as defined in the sacct documentation:

  • MaxRSS: Maximum resident set size of all tasks in job.

  • AveRSS: Average resident set size of all tasks in job.

  • ReqMem: Minimum required memory for the job, in MB. A 'c' at the end of number represents Memory Per CPU, a 'n' represents Memory Per Node.

  • AllocTres: Trackable resources. These are the resources allocated to the job/step after the job started running.

In our slurm script we didn’t define any memory requirements, so we were allocated the default of 1000M per cpu. So we can calculate the total memory required for the jobs as 4 * 1000M = 3.91GB and 16 * 1000M = 15.62GB. The MaxRSS / AveRSS values provide an indication of what was actually used.

I know for this example that no large amounts of memory are allocated. So, in both cases very little of the 3.91GB and 15.62GB that was allocated was actually used. Why is this worth considering?

  • Again, it goes back to efficient resource allocation. A teton node typically has 128G, of which probably 124GB is actually available (remember the node needs to use memory to run the OS). If I’d requested four cpus per job, but asked for 60GB, then only two jobs would be able to fit onto the teton node (2 x 60 = 120 < 124 while 3 x 60 = 180 > 124). So, even though we know we could use more cores on that 32 core node, we can’t because we’re being limited by our memory request.

  • One of the most common issues for jobs failing is due to Out-Of-Memory i.e. the application has tried to use more memory than the job has been allocated. What we often see is that a researcher will start with a small data set that fits within a certain memory allocation. Their data set gets larger but they don’t request more memory. Or maybe the configuration for the simulation is changed e.g. some mesh is made finer requiring more data points to be calculated and stored in memory. In both cases if the researcher notes and tracks the memory usage they are aware when more memory is required within an allocation.

  • Now, some researchers use mem=0 which requests ALL the memory on a node. Although there are cases when this is required, please note this essentially gives the user the entire node, and if they’re only using say 4 cores, then there are 28 cores that are not available. If the job only uses a fraction of the total memory required then this is a wasteful resource request and will effect not only how quickly their batch of jobs can be allocated and run, but also everyone else! We have a fair-share policy on teton and request users to try and request appropriate resources.

 

These are two basic examples that introduce you to some of the reasons why it’s important and useful to track and analyze your job efficiency.

Time

Lets look at the time requested, and the actual time taken for the two jobs:

Although 5 minutes was requested for both, they both took about the same time 27 and 22 seconds. Why is this worth considering?

  • First, it can again highlight the multi-code aspects of a job. Remember the second job requested 16 cores, but took about the same time as the job with only 4 cores, this should make you question whether or not this example is multi-core or not.

  • The second can affect the queuing of the jobs. Slurm job queuing and allocation can be a mystical black box considering an array of job resources, with time being one. Remember, slurm can only go on what time you have requested, it can not predict how long it will actually take. One of the common consequences is when the the cluster is under maintenance.

Consider the following case: I’m going to submit a job on the 24th at noon, with time=7:00:00:00 (7 days) but I’m aware that the cluster is under maintenance on the 28th starting at midnight. Since we can see that the cluster will not be available in three and a half days, our seven day job can not fit within this window, and thus will be added to the queue in a pending state, as demonstrated in the squeue results below.

Alternative: Command: Time

There are alternatives to using the slurm commands such as using the Linux time command into your script, and/or bash commands as demonstrated in the following examples:

Bash: time

Here is a bash time example:

Using the above, within your output you will see something of the form:

System: time

The alternative is to use the system’s time command. Have a read of the man time page for more details, and look at the -v option depending on how much detail you want when using the command:

The only change in the above example is the following:

Resulting in:

Check Running Jobs

Command: sstat

Similar to sacct, the sstat command display various status information of a job/step while it is actually running (sacct is more post completion).

It is a command-line tool, but scripts can be developed that call the command on a regular interval while a job is running and you can monitor various key metrics such as CPU frequency, max and average RSS, as well as disk reads and rights.

Note:

  • This command does not directly call the compute nodes a job is running on.

  • It makes a call to the central slurm controller, and if this is bombarded by too many calls too quickly it will effect the performance of slurm across the cluster. In fact the documentation does state: “Do not run sstat or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to sstat to the minimum necessary for the information you are trying to gather.”. So, please use appropriately and with consideration.