Goal: Introduction to Slurm and how to start interactive sessions, submit jobs and monitor.
...
Code Block |
---|
[]$ salloc –A arccanetrain –t 1:00 --reservation=<reservation-name> |
Warning |
For the August 2024 workshop, the reservation is Aug_bootcamp |
...
Interactive Session: squeue
: What’s happening?
...
Info |
---|
|
...
Submit Jobs: sbatch
: Template
Code Block |
---|
#!/bin/bash # Shebang indicating this is a bash script. # Do NOT put a comment after the shebang, this will cause an error. #SBATCH --account=arccanetrain # Use #SBATCH to define Slurm related values. #SBATCH --time=10:00 # Must define an account and wall-time. #SBATCH --reservation=<reservation-name> echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables. start=$(date +'%D %T') # Can call bash commands. echo "Start:" $start module load gcc/12.2.0 python/3.10.6 # Load the modules you require for your environment. python python01.py # Call your scripts/commands. sleep 1m end=$(date +'%D %T') echo "End:" $end |
...
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526340 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:05 1 m233 []$ ls python01.py run.sh slurm-13526340.out # You can view this file while the job is still running. []$ []$ cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:17 1 m233 |
Submit Jobs: squeue
: What’s happening Continued?
Code Block |
---|
[]$ squeue -u arcc-t05
|
Info |
---|
|
...
Submit Jobs: squeue
: What’s happening Continued?
Code Block |
---|
[]$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526340 moran run.sh arcc-t05 R 0:29 1 m233 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # squeue[]$ only shows pending and running jobs. # If a job is no longer in the queue then it has finished. # Finished can mean success, failure, timeout... It’s just no longer running. []$ cat slurm-13526340cat slurm-13526340.out SLURM_JOB_ID: 13526340 Start: 03/22/24 09:38:36 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) End: 03/22/24 09:39:36='final', serial=0) End: 03/22/24 09:39:36 |
Info |
---|
|
...
Submit Jobs: scancel
: Cancel?
Code Block |
---|
[]$ sbatch run.sh Submitted batch job 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 13526341 moran run.sh arcc-t05 R 0:03 1 m233 []$ scancel 13526341 []$ squeue -u arcc-t05 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) []$ cat slurm-13526341.out SLURM_JOB_ID: 13526341 Start: 03/22/24 09:40:09 Python version: 3.10.6 (main, Oct 17 2022, 16:47:32) [GCC 12.2.0] Version info: sys.version_info(major=3, minor=10, micro=6, releaselevel='final', serial=0) slurmstepd: error: *** JOB 13526341 ON m233 CANCELLED AT 2024-03-22T09:40:17 *** |
Info |
---|
If you know your job no longer needs to be running please cancel it to free up resources - be a good cluster citizen. |
...
Submit Jobs: sacct:
What happened?
...