/
Introduction to Job Submission 01: Nodes, Tasks and Processors

Introduction to Job Submission 01: Nodes, Tasks and Processors

Introduction

The Slurm page introduces the basics of creating a batch script that is used on the command line with the sbatch command to submit and request a job on the cluster. This page is an extension that goes into a little more detail focusing on the use of the following options:

  1. nodes

  2. ntasks-per-node

  3. cpus-per-task

  4. ntasks

and how they can be used and combined to request specific configurations of nodes, tasks and cores.

A complete list of options can be found on the Slurm: sbatch manual page or by typing man sbatch from the command line when logged onto teton.

Aims: The aims of this page are to:

  • Get the user to start thinking about the resources they require for a job and how to select the appropriate configuration and number of nodes, tasks and cores.

  • Extend a user's knowledge of the options available and how they function when creating batch scripts for submitting jobs.


Note:

  • This is an introduction, and not a complete overview of all the possible options available when using Slurm.

  • There are always alternatives to what is listed on this page, and how they can be used.

  • Using the options and terminology on this page, ARCC is better able to support you.

  • There are no hard and fast rules for which configuration you should use. You need to:

    • Understand how your application works with respect to using across a cluster and parallelism (if at all). For example is it build upon the concepts of MPI and OpenMP and can use multiple nodes and cores.

    • Not all applications can use parallelism and instead you simply run 100s of separate tasks.

Please share with ARCC your experiences of various configurations for whatever application you use so we can share it with the wider UW (and beyond) research community.

Prerequisites: You:

  • have an understanding of HPC systems and have taken the Intro to HPC at UW (online)

  • can create a simple bash script as detailed on the Slurm parent and child pages.

  • understand the terms: nodes, tasks, cores.

 

Resources

UW's HPC cluster is made up of over 500 nodes as described on the Teton Overview page, divided up into common hardware configurations using partitions. As this is an introduction, we will only be demonstrating examples using the moran (16 cores) and teton (32 cores) partitions. When you submit a job with no partition defined, Slurm will first try and allocate using the resources available on moran and then teton.

Diagram Key: Within the diagrams that follow:

 

Node: A node is made up of cores, and depending on the type of node it might have 16, 32, 40 or even more cores. Any allocated node will by default have a single task.

 

Task: A task runs on a single node. You can have multiple tasks running on a single node. You can not have a single task running over multiple nodes. Tasks by default will be allocated one core, but depending on your options it can have multiple cores. All the cores associated with a task will be demonstrated by being enclosed within the task's black boundary.

Core: A single core. Nodes are made up of cores. If a node is made up of 16 cores, then there will be 16 core icons within the node.

So, in the following diagram:

 

We have a single node made up of sixteen cores. There is one task running on that node, with that task using one core.

As an aside, what this also means is that there are fifteen cores not being used which can be allocated to other jobs.

Nodes

If you do not use any of the four options, by default Slurm will allocate a single node, with a single task, using a single core. This is mimicked using the following:

#SBATCH --nodes=1

If you require more nodes, for example four, then use:

#SBATCH --nodes=4

Note, that although we have allocated four nodes, each node is still only running a single task using a single core.

Tasks

In the last example we can see that running four nodes, each with one task (using one core) is not the most efficient use of resources.
If you require multiple tasks, or maybe your application requires tasks to be grouped on a single node (there are lots of potential scenarios) then you can use the ntasks-per-node option.
By default, only one task is allocated per node:

#SBATCH --nodes=1 #SBATCH --ntasks-per-node=1

If you require say four tasks, use:

Or maybe 16 tasks, use:


But what if you require more than 16 tasks on an individual node? Say 17?


Although there are only 16 cores on the moran nodes, slurm will automatically try to allocate your job on the teton nodes that have 32 cores.

 

What happens if I require 33 tasks on a single node? Say 33?


Without going into detail as this is only an introduction, slurm is aware of other hardware configurations across the cluster and will do it's best to allocate you job on nodes that can accommodate your request.
But, if you get the following error message on the command line:

This means you are asking for too many cores (per node) on the partition you are using. To solve this you need to reduce the overall number of cores being allocated per individual node.
The following will cause such an error as you are trying to allocate 17 tasks (each using a single core) on a partition that only has nodes with 16 cores.

Cores per Node/Task

By default a node will be allocated a single task using a single core. Depending on how your application is parallelised it might be able to use multiple cores per task. The cpus-per-task option allows you to define the number of cores that will be allocated to each task.

Remember, that by default, a single task is allocated per node. This is the same as the following:

The following is an example of requesting a single node, but running two tasks, with each task using four cores. In total the node will use 2 * 4 = 8 cores.

A node has a maximum number of cores that can be allocated (moran nodes have 16, teton nodes have 32) and you can not request more than that maximum number of cores. Within your options, the value of ntasks-per-node * cpus-per-task can not exceed the maximum number of cores for the type of node you are requesting. If you specifically request the teton partition where each node has 32 cores, you could request:

 

ntasks-per-node (=5) * cpus-per-task (=6) : total 30 cores which is less than the maximum of 32.

But, if you tried the following:

The job submission would fail with the following:

This is because, ntasks-per-node (=8) * cpus-per-task (=6) : total 48 cores which is more than is available on that type of node.

Nodes, Tasks and Cores

Again, depending on how your application is parallelised, you can request multiple nodes, running multiple tasks, each using multiple cores.
This first example illustrates requesting two nodes, with each node running two tasks, with each task using three cores. So, a total of six cores on each node, and an overall total of twelve cores for your job.

This second example illustrates requesting two nodes, with each node running three tasks, with each task using four cores. So, a total of twelve cores on each node, and an overall total of twenty four cores for your job.

 

Note: For a job, each node will have the same configuration of tasks and cores. You can not request different tasks/core configurations across nodes within a specific job.
There are always different configurations to request the same overall total number of cores for a job. For example, the following two configurations both use a total of 30 cores:

3 nodes * 2 tasks per node * 5 cores per task = 30

2 nodes * 3 tasks per node * 5 cores per task = 30


Which To Use? Well, there is no right or wrong answer:

  • Your application might explicitly require a certain amount of nodes.

  • You might need a certain amount of memory per core (See mem-per-cpu option - page to come) which restricts the number of cores you can request per node.

  • You might want to efficiently pack the cores you're using on a particular node.

  • Many other scenarios...

But ARCC is here to help and work with you to come up with the best configuration.

Not Defining Nodes: ntasks

If you do not need to explicitly request a specific number of nodes, you can use the ntasks option. This will try and allocate (depending on the current resources available at that time) an appropriate number of nodes that your configuration can fit onto.
This first example, explicitly using the moran partition, requests 16 tasks (using by default one core per task). We can allocate this onto a single node.


If we require 17 tasks, that will not fit on a single (16 core) node, then these will be allocated across two nodes, the first with 16 tasks, the second with 1.

If we require 24 tasks then these again will be allocated across two nodes, the first with 16 tasks, the second with 8.


If we asked for 40 tasks then we'd get three nodes (16 + 16 + 8).
If we had instead defined the teton partition, then the above three examples of 16, 17 and 24 tasks would all be allocated across a single teton node that is able to accommodate a total of 32 cores.

ntasks and cpus

We can also combine the ntasks and cpus-per-task options together:
Since we are explicitly requesting the moran partition with each node having 16 cores, we know we can fit one task (using 16 cores) on a single node, so three nodes will be allocated.

Now that we only require eight cores per task, we can now fit two tasks per node, so only two nodes are required to accommodate our allocation.

Finally, now that we asking to use teton nodes (32 cores per node) which can fit all three tasks onto a single node.

Notice that in the previous three examples we did not define the nodes option. The scheduler will automatically try and allocate the appropriate amount of nodes that our required configuration can fit across.

nodes and ntasks

Although there is nothing wrong using nodes and ntasks options together, ideally you'd use one or the other. So a common question is which option to use? Again this depends on you requirements, but here are some final examples to illustrate the differences:
The first shows five nodes, each running a single task, with each task using four cores.

The second illustrates using the ntasks option which still allocates five tasks each with four cores, but now distributed across only two nodes.

If you use nodes and ntasks together, then you will only get the number of nodes required to fulfill the number of tasks. So, although we've asked for five nodes, we've only asked for four ntasks. The job will thus only be allocated four (not five) nodes.

Slurm will notify you of such cases with the following warning message:


Finally, if we hadn't requested any nodes, and only asked for four tasks, then this can actually fit on a single node.

Note:
Slurm will select the best allocation across the cluster for a submitted job with respect to the resources available at the time. So, depending on its current load, some of the allocations with respect to using ntasks might have different configurations than represented in the previous diagrams. If for some reason you specifically need a certain amount of nodes then use the nodes option.
Do not expect an even distribution of tasks across nodes. For example:

This will not evenly distribute ten tasks per node. Instead, if using the moran partition you will likely get a distribution of 16, 16, 16, 16, 16, 16, 1, 1, 1, 1 (total of 100), or on the teton partition a distribution of 32, 32, 32, 1, 1, 1, 1, 1, 1, 1 (total of 100). So, although you get ten nodes allocated, the number of tasks on each node is not the same. This can significantly effect the amount of memory being used on a node.
If you require an even distribution then use:

Shortcuts

Many of the slurm options have shortcuts:

  • --nodes : -N

  • --ntasks : -n

  • --cpus-per-task : -c

  • --partition : -p

Here is a comparison of two requests that are asking for the same allocation, the one on the left using the standard options, the one on the right using shortcuts. Notice that shortcuts do not use an equals sign character '=' between the character flag and define number, also the shortcut is preceded by only a single dash character '-', not two '--'.

How Many Cores and/or Memory Should I Request?

  • There are no hard and fast rules on how to configure your batch files as in most cases it will depend on the size of your data and extent of analysis.

  • You will need to read and understand how to use the plugin/command as they can vary.

  • Memory is still probably going to be a major factor in how many cpus-per-task you choose.

  • In the example above we were only able to use 32 cores because we ran the job on one of the teton-hugemem partition nodes. Using a standard Teton node we were only able to use 2 cores. The latter still gave us an improvement of running for 9 hours and 45 minutes, compared to 17 hours with only a single core. But, using 32 cores on a hugemem node, the job ran in 30 minutes!

    • Remember, hugemem nodes can be popular, so you might actually end up queuing for days to run a job in half an hour when you could have jumped on a Teton node immediately and already have the longer running job finished.

    • Depending on the size of data/analysis you might be able to use more cores on a Teton node.

You will need to perform/track analysis to understand what works for your data/analysis. Do not just use a hugemem node!

Summary

In this introduction we've looked at using the four sbatch options nodesntasks-per-nodecpus-per-task and ntasks and various combinations of them.

There are no hard and fast rules, but we would recommend:

  • Make sure you fully understand how your application works regarding parallelization as this will direct you to the best configuration.

  • Every application is different, and every simulation can have different requirements, we would recommend keeping a record of the various configurations you use, and track what works, and what doesn't work. If you could share this developed experience with us, we'd like to share it across the entire UW research community (and beyond).

  • Try testing with 'small' configurations first before request 'large' configurations, and check that your application is behaving as you expect.

  • There are other sbatch options, such as mem-per-cpu, as well as other node types (some with 40 cores) and other partitions, and these also all play a part in how your job is allocated. We are creating additional pages that more specifically go into these.

  • There are always alternatives, and if you are comfortable using them then by all means use them. But please be aware that we might not know your alternative and it will take us time to interpret and understand and we will often come back to you for further details and explanations. If we can use common terminology we can more efficiently and effectively assist you.

  • For those who use salloc to create an interactive session, all of the options will apply.


Finally: We welcome feedback, and if anything isn't clear, or something is missing, or in fact you think there is a mistake, please don't hesitate to contact us.

Related pages