Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Info
  • You’ll only use the reservation for this (and/or other) workshop.

  • Once you have an account you typically do not need it.

  • But there are use cases when we can create a specific reservation for you.

Code Block
[]$ salloc –A arccanetrain<project-name> –t 1:00 --reservation=<reservation-name>

...

Interactive Session: squeue: What’s happening?

Code Block
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.

# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       0:19      1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$ 
...
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       1:03      1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

...

Interactive Session: salloc: Finished Early?

Code Block
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338

...

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch

...

: Example

Info

The following is an example script that we will use to submit a job to the cluster.

It uses a short test python file defined here: python script.

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=arccanetrain<project-name>            # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/1213.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:17      1 m233
Info
  • By default, an output of the form: slurm-<job-id>.out will be generated.

  • You can view this file while the job is still running. Only view, do not edit.

...

Submit Jobs: squeue: What’s happening? Continued

...

Code Block
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:29      1 m233
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36

...

Code Block
# Lots more information
[]$ squeue --help
[]$ man squeue

# Display more columns:
# For example how much time is left of your requested wall time: TimeLeft
squeue -u arcc-t05<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
[salexan5@mblog1 ~]$ squeue -u vvarenth<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
ACCOUNT             USER                JOBID               SUBMIT_TIME         START_TIME          TIME_LEFT
arccantrain<project-name>        <username> arcc-t05            1795458             2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
arccantrain<project-name>      <username>   arcc-t05        1795453    1795453             20242024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
arccantrain<project-name>         arcc-t05  <username>          1795454             2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
...

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh arcc-t05<username>  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***

...

The main Slurm sacct page.

Code Block
[]$ sacct -u arcc-t05<username> -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+          1    TIMEOUT      0:0
13526338     interacti+      moran arccanetr+          1  COMPLETED      0:0
13526340         run.sh      moran arccanetr+          1  COMPLETED      0:0
13526341         run.sh      moran arccanetr+          1 CANCELLED+      0:0

# Lots more information
[]$ sacct --help
[]$ man sacct

# Display more columns:
[]$ sacct -u arcc-t05<username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodes        NodeList      NCPUS     ReqMem      State               Start    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337          moran        1            m233          1      1000M    TIMEOUT 2024-03-22T09:35:25   00:01:28
13526338          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:37:41   00:00:06
13526340          moran        1            m233          1      1000M  COMPLETED 2024-03-22T09:38:35   00:01:01
13526341          moran        1            m233          1      1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

...

Code Block
[]$ sbatch –-help
#SBATCH --account=arccanetrain <prohect-name>         # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop             # Job name: Help to identify when using squeue.
#SBATCH --nodes=1                       # Options will typically have defaults.
#SBATCH --tasks-per-node=1              # Request resources in accordance to how you want
#SBATCH --cpus-per-task=1               # to parallelize your job, type of hardware partition
#SBATCH --partition=teton-gpumb                  # and if you require a GPU.
#SBATCH --gres=gpu:1
#SBATCH --mem=100G                      # Request specific memory needs.
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL                 # Get email notifications of the state of the job.
#SBATCH --mail-user=<email-address>
#SBATCH --output=<prefix>_%A.out        # Define a named output file postfixed with the job id.
Info
  • Both salloc and sbatch have 10s of options, in both short and long form.

  • Some options mimic functionality, for example -G works the same as --gres=gpu:1.

  • Please consult the command --help and man pages and/or web links to discover further options not listed.

...

Submit Jobs: sbatch: Options: Applied to Example

Info

Let’s take the previous example, and add some of the additional options:

Code Block
#!/bin/bash                               
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>

#SBATCH --job-name=pytest
#SBATCH --nodes=1                       
#SBATCH --cpus-per-task=1               
#SBATCH --mail-type=ALL                 
#SBATCH --mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out      

echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/13.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end
Info

Notice:

  • I’ve given the job a specific name and have requested email notifications.

  • The output is written to a sub folder slurm/ with a name of the form pytest_<jobid>.out

...

Extended Example: What Does the Run look Like?

Info

With the above settings, a submission will look something like the following:

Expand
titleExample Flow and Output:
Code Block
# Submit the job:
[intro_to_modules]$ sbatch run.sh
Submitted batch job 1817260

# Notice the NAME is now 'pytest'
[intro_to_modules]$ squeue -u salexan5
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
           1817259        mb   pytest <username>  R       0:58      1 mbcpu-002

# I can view the output while the job is running.
# The output is now in a sub folder under slurm/
# It also uses the name 'pyresults_<job_id>.out'
[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[intro_to_modules]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 08/14/24 14:49:38
Info

In my inbox, I also received two emails with the subjects:

  1. medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00

    1. This will have no text within the email body.

  2. medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0

    1. The body of this email contained the seff results.

...

...