Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Interactive Session: salloc

Info

You’re there doing the work.

Suitable for developing and testing over a few hours.

Code Block
[]$ salloc -–help
[]$ man salloc
# Lots of options. 

# The bare minimum.
# This will provide the defaults of one node, one core and 1G of memory.
[]$ salloc –A <project-name> -t <wall-time>

...

Interactive Session: salloc: workshop

Info

You’ll only use the reservation for this (and/or other) workshop.

Once you have an account you typically do not need it.

But there are use cases when we can create a specific reservation for you.

Which itself might require a partition to be defined if you’re using a GPU node (more about that later).

Code Block
# CPU only compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name>

# GPU partition/compute node.
[]$ salloc –A <project-name> –t 1:00 --reservation=<reservation-name> --partition=<partition-name>

...

Info

Closing the session will also release the job.

...

Exercise: salloc: Give It A Go

From a login node, create some interactive sessions using: salloc \.

Try different wall times:

  • Short times to experience an automatic timeout.

  • Longer times so you can call squeue and see your job in the queue.

Notice how the command-line prompt changes.

...

Submit Jobs: sbatch

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Example

...

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=<project-name>          # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/1314.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340

[]$ squeue -u <username>
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh <username>  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, OctSep  173 20222024, 1615:4713:3256) [GCC 1214.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh <username>  R       0:17      1 m233

...

Submit Jobs: squeue: What’s happening? Continued

Info

The squeue command only shows pending and running jobs.

If a job is no longer in the queue then it has finished.

Code Block
[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh <username>  R       0:29      1 m233

[]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, OctSep  173 20222024, 1615:4713:3256) [GCC 1214.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh <username>  R       0:03      1 m233

[]$ scancel 13526341

[]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Sep Oct 173 20222024, 1615:4713:3256) [GCC 1214.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***

...

Code Block
#!/bin/bash                               
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>

#SBATCH --job-name=pytest
#SBATCH --nodes=1                       
#SBATCH --cpus-per-task=1               
#SBATCH --mail-type=ALL                 
#SBATCH --mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out      

echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/1314.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Expand
titleExample Flow and Output:
Code Block
# Submit the job:
[intro_to_modules]$ sbatch run.sh
Submitted batch job 1817260

# Notice the NAME is now 'pytest'
[intro_to_modules]$ squeue -u salexan5
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
           1817259        mb   pytest <username>  R       0:58      1 mbcpu-002

# I can view the output while the job is running.
# The output is now in a sub folder under slurm/
# It also uses the name 'pyresults_<job_id>.out'
[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Sep Apr 303 2024, 1115:2313:0456) [GCC 1314.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[intro_to_modules]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Sep Apr 303 2024, 1115:2313:0456) [GCC 1314.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 08/14/24 14:49:38
Info

In my inbox, I also received two emails with the subjects:

  1. medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00

    1. This will have no text within the email body.

  2. medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0

    1. The body of this email contained the seff results.

...

Exercise: sbatch: Give It A Go

Using the script examples (adjust were appropriate) try submitting some jobs.

  • Once submitted (within a different session) monitor the jobs using the squeue command.

  • Track the job ids, and try changing the job name to distinguish when viewing the pending/running jobs.

  • Cancel some of the jobs.

  • Maybe try increasing the sleep value to be longer than the requested wall time to trigger a timeout.

  • Once they’ve completed, run sacct to view the finished jobs, and look at their state.

...

...