Run a Job: Batch Script Basics and Common Terms

Run a Job: Batch Script Basics and Common Terms

SLURM

At ARCC, we use the SLURM job scheduler for cluster/resource management, and job scheduling. SLURM is responsible for allocating resources to users, and provides a framework for users to start, execute and monitor work on requested and allocated resources. It also allows users to schedule work for execution at a later time.

To learn more about SLURM, check out their documentation at: https://slurm.schedmd.com/documentation.html

Jobs

A job is an allocation of resources like compute nodes, GPUs, or cores that get assigned to a user for a specific amount of time. Jobs may be interactive or submitted as a batch script for a later scheduled execution.

When a job gets assigned to a specific set of hardware (this may be a collection of nodes, cores, GPUs, etc.) the job can specify commands to initiate parallel work in the form of job steps based on an configuration within their allocated hardware.

Batch Scripts

Batch scripts are used to submit jobs to the cluster with the sbatch command. Your batch script will likely contain one or more srun commands to launch parallel tasks. Examples for running different batch scripts may be found here.

The Anatomy of a Batch Script

Below is an example job script with common commands. What follows is a breakdown of each line and corresponding directives within the script:

#!/bin/bash #SBATCH --account=arcc #SBATCH --qos=debug #SBATCH --time=0-00:10:00 #SBATCH --nodes=1 #SBATCH --ntasks=24 #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=2G #SBATCH --mail-user=cowboyjoe@uwyo.edu #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --get-user-env export OMP_PROC_BIND=true export OMP_PLACES=threads export OMP_NUM_THREADS=32 module load miniconda3 module load gcc/14.2.0 srun echo "Start Job Process" srun hostname srun sleep 30 srun --cpu-bind=cores check-hybrid.gnu.pm srun echo "End Job Process"

 

Directive Type

Subdirective / Command

What it Does

Corresponding Line #(s)

Ex of Output & How it’s Provided

Directive Type

Subdirective / Command

What it Does

Corresponding Line #(s)

Ex of Output & How it’s Provided

Shebang:

#!/bin/bash

Used to tell the cluster system to use the bash interpreter at the path /bin/bash on the system.

1

n/a

Sbatch:

 

Request your resources with SBATCH directives. SBATCH directives are always preceded with a # sign and should be in uppercase format.

These directives provide SLURM with information and specifics about your job including the hardware you require, how long you need access to the hardware, where job output is written, and how to alert the user of job status

3-12

 

 

#SBATCH --account=arcc

Tells the system which account/project you’re running your job under. In the example, the job is run associated with the arcc project. If arcc were not a valid project or account, or the submitter isn’t a member of the project/account, the job will not run. Account is a mandatory directive that must be included within your submission script. Jobs that don’t specify a time or a qos won’t run.

3

 

 

#SBATCH --qos=debug

Tells SLURM to submit the job to the debug qos/queue.

4

 

 

#SBATCH --time=0-00:10:00

The example requests a job with a time limit of 0 days - 00 hours : 10 minutes : 00 seconds.

5

 

 

#SBATCH --nodes=1

Tells SLURM the entire job will run on a single node on the cluster

6

 

 

#SBATCH --ntasks=24

Tells SLURM that we will be running our job across 24 tasks at a time

7

 

 

#SBATCH --cpus-per-task=4

Tells SLURM we will need 4 CPUs to running each task

8

 

 

#SBATCH --mem-per-cpu 2G

Tells SLURM that each task running on 4 CPUs will also be allocated 2GB of RAM

9

 

 

#SBATCH --mail-user=cowboyjoe@uwyo.edu

Tells SLURM to e-mail the specified user (cowboyjoe@uwyo.edu) for all events specified under --mail-type directive.

10

 

 

#SBATCH --mail-type=BEGIN,END,FAIL

Tells SLURM to e-mail the above e-mail address during the following events: job start (BEGIN), job finish (END) and job failure/error (FAIL)

11

 

 

#SBATCH --get-user-env

Tells SLURM to get the login environment variables. Users should be aware that any environment variables already set by #SBATCH will take precedent over the local user’s login environment. Users should clear environment variables before calling #SBATCH directives if they are not to be applied to the spawned program.

12

 

Open MP:

 

Specifies threads and how they’re set up/distributed using OpenMP Framework

14-16

 

 

export OMP_PROC_BIND=true

Specifies the number of OMP threads for the job (Value should equal --ntasks * --cpus-per-task specified above in SBATCH directives. Ex: 24 * 4 = 96).
Note: The total number of threads here should be less than or equal to the maximum number of cores on any single nodes if you’ve specified specific node(s) in your SBATCH directives (example: #SBATCH --partition=<partition_name>, or #SBATCH --qos=<qos_name>)
export OMP_NUM_THREADS=96

14

 

 

export OMP_PLACES=threads
export OMP_DISPLAY_ENV=true
export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="Thread Affinity %0.3L %.8n %.15{thread_affinity}%.12H"
export OMP_PROC_BIND=true

Enables thread binding using OpenMP and prints out information about thread affinity to the start of the job output file

15

 

 

export OMP_NUM_THREADS=32

Bind each thread to a core/cpu.
export OMP_PLACES=cores

16

 

Module(s):

 

Load any required software

18-19

 

 

module load miniconda3

Load miniconda3 software to use during the course of the job

18

 

 

module load gcc/14.2.0

Load gcc compiler to use during the course of the job, specifically version 14.2.0

19

 

Slurm Run:

 

Executes whatever follows srun now. Don’t do other commands until this finishes. (By default, next processes are blocked).

 

 

 

srun echo "Start Job Process"

Print “Start Job Process” to the terminal.

 

 

srun hostname

Print the hostname of the node we’re running on to the terminal. Don’t do next commands until this finishes.

 

 

 

srun sleep 30

Stop executing things for 30. Don’t do next commands until this finishes.

 

 

 

srun --cpu-bind=cores check-hybrid.gnu.pm

The executable check-hybrid.gnu.pm performed is to be bound to the allotted cores, .Don’t do next commands until this finishes.

 

 

 

srun echo "End Job Process"

Print “End Job Process” to the terminal. Don’t do next commands until this finishes.

 

 

  1. Shebang:
    Used to tell the cluster system to use the bash interpreter at the path /bin/bash on the system.
    #!/bin/bash
    If we didn’t know the path to the bash interpreter, we could replace this with: #!/usr/bin/env bash

  2. Request your resources with SBATCH directives. SBATCH directives are always preceded with a # sign and should be in uppercase format. These directives provide SLURM with information and specifics about your job including the hardware you require, how long you need access to the hardware, where job output will be written, and how to alert the user of job status :

    1. #SBATCH --account=arcc
      This directive tells the system which account/project you’re running your job under. In this example, the job is run associated with the arcc project. If arcc were not a valid project or account, the submitter is not a member of the project/account, the job will not run. Account is a mandatory directive that must be included include within your submission script. Jobs that don’t specify a time or a qos will not run. The above example requests a job with a time limit of 10 minutes.

    2. #SBATCH --qos=debug
      This particular SBATCH directives tells SLURM to submit the job to the debug qos/queue.

    3. #SBATCH --time=0-00:10:00
      This directive sets a time limit on the total run time of the job. If the requested time exceeds the partitions time limit, the job is left in a PENDING state. Time should be formatted as days-hours:minutes:seconds. It is mandatory to specify either a time or qos in your job script. Jobs that don’t specify a time or a qos will not run. The above example requests a job with a time limit of 10 minutes.

    4. #SBATCH --nodes=1
      This directive tells SLURM the entire job will run on a single node on the cluster.

    5. #SBATCH --ntasks=24
      This directive tells SLURM that we will be running our job across 24 tasks

    6. #SBATCH --cpus-per-task=4
      Tells SLURM we will be running each task on 4 CPUs

    7. #SBATCH --mem-per-cpu 2G
      Tells SLURM that each task running on 4 CPUs will also be allocated 2GB of RAM

    8. #SBATCH --mail-user cowboyjoe@uwyo.edu
      Tells SLURM to e-mail the specified user (cowboyjoe@uwyo.edu) for all events specified under --mail-type directive.

    9. #SBATCH --mail-type BEGIN,END,FAIL
      Tells SLURM to e-mail the above e-mail address during the following events: job start (BEGIN), job finish (END) and job failure/error (FAIL)

    10. #SBATCH --get-user-env
      Tells SLURM to get the login environment variables. Users should be aware that any environment variables already set by #SBATCH will take precedent over the local user’s login environment. Users should clear environment variables before calling #SBATCH directives if they are not to be applied to the spawned program.

  3. Specifies threads and how they are set up with Open MP

    1. Specifies the number of OMP threads for the job (Value should equal --ntasks * --cpus-per-task specified above in SBATCH directives. Ex: 24 * 4 = 96).
      Note: The total number of threads here should be less than or equal to the maximum number of cores on any single nodes if you’ve specified specific node(s) in your SBATCH directives (example: #SBATCH --partition=<partition_name>, or #SBATCH --qos=<qos_name>)
      export OMP_NUM_THREADS=96

    2. Enables thread binding using OpenMP and prints out information about thread affinity to the start of the job output file
      export OMP_DISPLAY_ENV=true
      export OMP_DISPLAY_AFFINITY=true
      export OMP_AFFINITY_FORMAT="Thread Affinity %0.3L %.8n %.15{thread_affinity}%.12H"
      export OMP_PROC_BIND=true

    3. Bind each thread to a core/cpu.
      export OMP_PLACES=cores

  4. Load any required software

 

What’s the difference between sbatch and srun?

Both sbatch and srun are both SLURM commands that accept similar parameters so it’s easy to be confused by how each should be used.

The main difference is srun is interactive and blocking. This means you get your results in your terminal and you cannot execute other things in your terminal until the srun command running is finished. sbatch is batch processing and non-blocking, so it’s run is submitted to queue, and when pulled out of queue to be run, the results get written to a file, and you’re able to submit new/other commands right away.

Another difference between srun and sbatch is that sbatch allows you to run job arrays while srun does not. Additionally, srun can be and often is run from within an sbatch script.