Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=arccanetrain            # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module load gcc/1213.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Code Block
[]$ sbatch –-help
#SBATCH --account=arccanetrain          # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop             # Job name: Help to identify when using squeue.
#SBATCH --nodes=1                       # Options will typically have defaults.
#SBATCH --tasks-per-node=1              # Request resources in accordance to how you want
#SBATCH --cpus-per-task=1               # to parallelize your job, type of hardware partition
#SBATCH --partition=teton-gpumb                  # and if you require a GPU.
#SBATCH --gres=gpu:1
#SBATCH --mem=100G                      # Request specific memory needs.
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL                 # Get email notifications of the state of the job.
#SBATCH --mail-user=<email-address>
#SBATCH --output=<prefix>_%A.out        # Define a named output file postfixed with the job id.
Info
  • Both salloc and sbatch have 10s of options, in both short and long form.

  • Some options mimic functionality, for example -G works the same as --gres=gpu:1.

  • Please consult the command --help and man pages and/or web links to discover further options not listed.

...

Submit Jobs: sbatch: Options: Apply to Example

Info

Let’s take the previous example, and add some of the additional options:

Code Block
#!/bin/bash                               
#SBATCH --account=arccanetrain
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>

#SBATCH --job-name=pytest
#SBATCH --nodes=1                       
#SBATCH --cpus-per-task=1               
#SBATCH --mail-type=ALL                 
#SBATCH --mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out      

echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module load gcc/13.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end
Info

Notice:

  • I’ve given the job a specific name and have requested email notifications.

  • The output is written to a sub folder slurm/ with a name of the form pytest_<jobid>.out

...

Extended Example: What Does the Run look Like?

Info

With the above settings, a submission will look something like the following:

Expand
titleExample Flow and Output:
Code Block
# Submit the job:
[salexan5@mblog1 intro_to_modules]$ sbatch run.sh
Submitted batch job 1817260

# Notice the NAME is now 'pytest'
[salexan5@mblog1 intro_to_modules]$ squeue -u salexan5
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1817259        mb   pytest salexan5  R       0:58      1 mbcpu-002

# I can view the output while the job is running.
# The output is now in a sub folder under slurm/
# It also uses the name 'pyresults_<job_id>.out'
[salexan5@mblog1 intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[salexan5@mblog1 intro_to_modules]$ squeue -u salexan5
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[salexan5@mblog1 intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 08/14/24 14:49:38
Info

In my inbox, I also received two emails with the subjects:

  1. medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00

  2. medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0

    1. The body of this email contained the seff results.

...

...