Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.

...

Code Block
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name>
Warning
For the August 2024 workshop, the reservation is Aug_bootcamp

...

Interactive Session: squeue: What’s happening?

...

Info
  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template

Code Block
#!/bin/bash                               
# Shebang indicating this is a bash script.
# Do NOT put a comment after the shebang, this will cause an error.
#SBATCH --account=arccanetrain            # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00                      # Must define an account and wall-time.
#SBATCH --reservation=<reservation-name>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID        # Can access Slurm related Environment variables.
start=$(date +'%D %T')                    # Can call bash commands.
echo "Start:" $start
module load gcc/12.2.0 python/3.10.6      # Load the modules you require for your environment.
python python01.py                        # Call your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end

...

Code Block
[]$ sbatch run.sh
Submitted batch job 13526340
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:05      1 m233
[]$ ls
python01.py  run.sh  slurm-13526340.out
#
You can view this file while the job is still running.
[]$ []$ cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:17      1 m233

Submit Jobs: squeue: What’s happening Continued?

Code Block
[]$ squeue -u arcc-t05
  
Info
  • By default, an output of the form: slurm-<job-id>.out will be generated.

  • You can view this file while the job is still running. Only view, do not edit.

...

Submit Jobs: squeue: What’s happening Continued?

Code Block
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526340     moran   run.sh arcc-t05  R       0:29      1 m233
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
#
squeue[]$ only shows pending and running jobs.
# If a job is no longer in the queue then it has finished. 
# Finished can mean success, failure, timeout... It’s just no longer running.
[]$ cat slurm-13526340cat slurm-13526340.out
SLURM_JOB_ID: 13526340
Start: 03/22/24 09:38:36
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
End: 03/22/24 09:39:36='final', serial=0)
End: 03/22/24 09:39:36
Info
  • The squeue command only shows pending and running jobs.

  • If a job is no longer in the queue then it has finished.

  • Finished can mean success, failure, timeout... It’s just no longer running.

...

Submit Jobs: scancel: Cancel?

Code Block
[]$ sbatch run.sh
Submitted batch job 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13526341     moran   run.sh arcc-t05  R       0:03      1 m233
[]$ scancel 13526341
[]$ squeue -u arcc-t05
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[]$ cat slurm-13526341.out
SLURM_JOB_ID: 13526341
Start: 03/22/24 09:40:09
Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0]
Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0)
slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 ***
Info

If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen.

...

Submit Jobs: sacct: What happened?

...