Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Info
  • You’ll only use the reservation for this (and/or other) workshop.

  • Once you have an account you typically do not need it.

  • But there are use cases when we can create a specific reservation for you.

Code Block
[]$ salloc –A arccanetrain<project-name> –t 1:00 --reservation=<reservation-name>

...

Interactive Session: squeue: What’s happening?

Code Block
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526337
salloc: Nodes m233 are ready for job
# Make a note of the job id.

# Notice the server/node name has changed.
[arcc-t05@m233 intro_to_hpc]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       0:19      1 m233
# For an interactive session: Name = interact
# You have the command-line interactively available to you.
[]$ 
...
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526337     moran interact arcc-t05  R       1:03      1 m233
# Session will automatically time out
[]$ salloc: Job 13526337 has exceeded its time limit and its allocation has been revoked.
slurmstepd: error: *** STEP 13526337.interactive ON m233 CANCELLED AT 2024-03-22T09:36:53 DUE TO TIME LIMIT ***
exit
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

...

Interactive Session: salloc: Finished Early?

Code Block
[]$ salloc -A arccanetrain<project-name> -t 1:00 --reservation=<reservation-name>
salloc: Granted job allocation 13526338
salloc: Nodes m233 are ready for job
[arcc-t05@m233 ...]$ Do stuff…
[]$ exit
exit
salloc: Relinquishing job allocation 13526338

...

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch

...

: Example

Info

The following is an example script that we will use to submit a job to the cluster.

It uses a short test python file defined here: python script.

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=arccanetrain<project-name>            # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module purge
module load gcc/1213.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:17      1 m233
Info
  • By default, an output of the form: slurm-<job-id>.out will be generated.

  • You can view this file while the job is still running. Only view, do not edit.

...

Submit Jobs: squeue: What’s happening? Continued

...

Code Block
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05<username>  R       0:29      1 m233
[]$ squeue -u arcc-t05<username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36

...

Code Block
# Lots more information
[]$ squeue --help
[]$ man squeue

# Display more columns:
# For example how much time is left of your requested wall time: TimeLeft
squeue -u arcc-t05<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
[salexan5@mblog1 ~]$ squeue -u vvarenth<username> --Format="Account,UserName,JobID,SubmitTime,StartTime,TimeLeft"
ACCOUNT             USER                JOBID               SUBMIT_TIME         START_TIME          TIME_LEFT
arccantrain<project-name>        <username> arcc-t05            1795458             2024-08-14T10:31:07 2024-08-14T10:31:09 6-04:42:51
arccantrain<project-name>      <username>   arcc-t05        1795453    1795453             20242024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
arccantrain<project-name>         arcc-t05  <username>          1795454             2024-08-14T10:31:06 2024-08-14T10:31:07 6-04:42:49
...

Submit Jobs: scancel: Cancel?

...

Submission from your Current Working Directory

Info

Remember from Linux, that your current location is your Current Working Directory - abbreviated to CWD.

By default Slurm will look for files, and write output, from the folder you submitted your script from i.e. your CWD.

In the example above, if I called sbatch run.sh from ~/intro_to_modules/ then the Python script should reside within this folder. Any output will be written into this folder.

Within the submission script you can define paths (absolute/relative) to other locations.

Info

You can submit a script from any of your allowed locations /home, /project and/or /gscratch.

But you need to manage and describe paths to scripts, data, output appropriately.

...

Submit Jobs: scancel: Cancel?

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u <username>
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh <username>  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u <username>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
Info

If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen.

...

Submit Jobs: sacct: What happened?

The main Slurm sacct page.

Code Block
[]$ sacct -u <username> -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+          1    TIMEOUT      0:0
13526338     interacti+      moran arccanetr+          1  COMPLETED      0:0
13526340         run.sh      moran arccanetr+          1  COMPLETED      0:0
13526341         run.sh      moran arccanetr+          1 CANCELLED+      0:0

# Lots more information
[]$ sacct --help
[]$ man sacct

# Display more columns:
[]$ sacct -u <username> --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodes        NodeList      NCPUS     ReqMem      State               Start    Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337          moran      JOBID PARTITION 1    NAME     USER ST  m233     TIME  NODES NODELIST(REASON)  1      1000M   13526341 TIMEOUT 2024-03-22T09:35:25   moran00:01:28
13526338  run.sh arcc-t05  R     moran  0:03      1   m233 []$ scancel 13526341 []$ squeue -u arcc-t05  m233          1  JOBID PARTITION   1000M  NAME COMPLETED 2024-03-22T09:37:41   00:00:06
USER13526340 ST       TIME  NODESmoran NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 1            m233          1      1000M  COMPLETED 2024-03-22T09:4038:17 ***
Info

If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen.

Submit Jobs: sacct: What happened?

The main Slurm sacct page.

Code Block
[]$ sacct -u arcc-t05 -X
JobID 35   00:01:01
13526341          moran        1          JobName  Partitionm233    Account  AllocCPUS    1  State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------
13526337     interacti+      moran arccanetr+ 1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

...

Submit Jobs: sbatch: Options

Info

Here are some of the common options available:

Code Block
[]$ sbatch –-help
#SBATCH --account=<prohect-name>        # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshop            1 # Job name: TIMEOUTHelp to identify when using squeue.
0:0
13526338#SBATCH --nodes=1     interacti+      moran arccanetr+          1 # COMPLETEDOptions will typically have defaults.
 0:0
13526340#SBATCH --tasks-per-node=1          run.sh    # Request moranresources arccanetr+in accordance to how you want
#SBATCH --cpus-per-task=1   1  COMPLETED      0:0 13526341   # to parallelize your job, type run.shof hardware partition
#SBATCH --partition=mb  moran arccanetr+          1 CANCELLED+    # and 0:0if you #require Lotsa moreGPU.
information
[]$ sacct#SBATCH --gres=gpu:1
#SBATCH --help
[]$ man sacct

# Display more columns:
[]$ sacct -u arcc-t05 --format="JobID,Partition,nnodes,NodeList,NCPUS,ReqMem,State,Start,Elapsed" -X
JobID         Partition   NNodesmem=100G                      # Request specific memory needs.
#SBATCH --mem-per-cpu=10G
#SBATCH --mail-type=ALL             NodeList    # Get NCPUSemail notifications of the state ReqMemof the job.
#SBATCH --mail-user=<email-address>
 State#SBATCH --output=<prefix>_%A.out        # Define a named output file postfixed Startwith the   Elapsed
------------ ---------- -------- --------------- ---------- ---------- ---------- ------------------- ----------
13526337job id.
Info
  • Both salloc and sbatch have 10s of options, in both short and long form.

  • Some options mimic functionality, for example -G works the same as --gres=gpu:1.

  • Please consult the command --help and man pages and/or web links to discover further options not listed.

...

Submit Jobs: sbatch: Options: Applied to Example

Info

Let’s take the previous example, and add some of the additional options:

Code Block
#!/bin/bash                  moran        1     
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --reservation=<reservation-name>
m233
#SBATCH --job-name=pytest
#SBATCH --nodes=1      1      1000M    TIMEOUT 2024-03-22T09:35:25   00:01:28 13526338  
#SBATCH --cpus-per-task=1      moran        1 
#SBATCH --mail-type=ALL         m233        
 1   #SBATCH --mail-user=<email-address>
#SBATCH --output=slurms/pyresults_%A.out   1000M  COMPLETED 2024-03-22T09:37:41

 00:00:06
13526340echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can moranaccess Slurm related Environment variables.
start=$(date +'%D %T') 1            m233       # Can call 1bash commands.
echo "Start:" $start
module 1000Mpurge
module COMPLETED 2024-03-22T09:38:35   00:01:01
13526341          moran   load gcc/13.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py     1            m233       # Call your 1scripts/commands.
sleep 1m
end=$(date +'%D %T')
1000M CANCELLED+ 2024-03-22T09:40:08   00:00:09

Submit Jobs: sbatch: Options

Info

Here are some of the common options available:

Code Block
[]$ sbatch –-help
#SBATCH --account=arccanetrain          # Required: account/time
#SBATCH --time=72:00:00
#SBATCH --job-name=workshopecho "End:" $end
Info

Notice:

  • I’ve given the job a specific name and have requested email notifications.

  • The output is written to a sub folder slurm/ with a name of the form pytest_<jobid>.out

...

Extended Example: What Does the Run look Like?

Info

With the above settings, a submission will look something like the following:

Expand
titleExample Flow and Output:
Code Block
# Submit the job:
[intro_to_modules]$ sbatch run.sh
Submitted batch job 1817260

# Notice the NAME is now 'pytest'
[intro_to_modules]$ squeue -u salexan5
            
#
 
Job
JOBID 
name:
PARTITION 
Help
 
to
 
identify
 
when
 
using
NAME 
squeue.
 
#SBATCH
 
--nodes=1
    USER ST       TIME  NODES NODELIST(REASON)
       
#
 
Options
 
will
 
typically
 
have
1817259 
defaults.
 
#SBATCH
 
--tasks-per-node=1
     mb   pytest <username>  R   
#
 
Request
 
resources
 
in
 
accordance
0:58 
to
 
how
 
you
 
want
 
#SBATCH --cpus-per-task=1
 1 mbcpu-002

# I can view the output while the job is running.
# The 
#
output 
to
is 
parallelize
now 
your
in 
job,
a 
type
sub 
of
folder 
hardware
under 
partition
slurm/
#SBATCH --partition=teton-gpu # and if you require a GPU. #SBATCH --gres=gpu:1 #SBATCH --mem=100G
# It also uses the name 'pyresults_<job_id>.out'
[intro_to_modules]$ cat slurms/pyresults_1817260.out
SLURM_JOB_ID: 1817260
Start: 08/14/24 14:48:38
Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)

[intro_to_modules]$ squeue -u <username>
     
#
 
Request
 
specific
 
memory
 
needs.
 
#SBATCH
 
--mem-per-cpu=10G
 
#SBATCH --mail-type=ALL
 JOBID PARTITION     NAME     USER ST     
#
 
Get
 
email
TIME 
notifications
 
of the state of the job. #SBATCH --mail-user=<email-address> #SBATCH --output=<prefix>_%A.out # Define a named output file postfixed with the job id.
Info
  • Both salloc and sbatch have 10s of options, in both short and long form.

  • Some options mimic functionality, for example -G works the same as --gres=gpu:1.

  • Please consult the command --help and man pages and/or web links to discover further options not listed
    NODES NODELIST(REASON)
    
    [intro_to_modules]$ cat slurms/pyresults_1817260.out
    SLURM_JOB_ID: 1817260
    Start: 08/14/24 14:48:38
    Python version: 3.10.6 (main, Apr 30 2024, 11:23:04) [GCC 13.2.0]
    Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
    End: 08/14/24 14:49:38
    Info

    In my inbox, I also received two emails with the subjects:

    1. medicinebow Slurm Job_id=1817260 Name=pytest Began, Queued time 00:00:00

      1. This will have no text within the email body.

    2. medicinebow Slurm Job_id=1817260 Name=pytest Ended, Run time 00:01:01, COMPLETED, ExitCode 0

      1. The body of this email contained the seff results.

    ...

    ...