Loren - Job Scheduler (Slurm)
The Loren cluster uses a job scheduler for managing and running job across the cluster. Slurm commands allow the user to submit job, query information about jobs and perform other functions.
Here’s a link to a web page that describes how to use slurm, A simple Slurm guide for beginners
Cluster nodes are grouped into different Slurm partitions:
Partition | Description | Nodes |
---|---|---|
Loren | This is the default partition for running jobs on Loren. This partition has most nodes included and should be used for most jobs. These nodes have Nvidia K20Xm GPUs. | loren[01-10,12-19,21-26,28-29,32-35,37-44,46-50] |
Loren-k80 | This queue has nodes that support the Nvidia K80 GPUs | loren[70-74] |
quick | This queue allows for short duration jobs up to one hour. | loren[30-31] |
mango | Specialty queue used to run the mango software. | loren[51,53-54,57-60] |
Getting help on Slurm Commands
For all the following commands you can use the “man” command to find out more information about using the command.
i.e. man sbatch
Submitting a job
You can start a batch job using the “sbatch” command. This command requires the user to provide the name of a script file to be run that defines the job.
You can also use the “salloc” command to start an interactive job. See the man page for information.
The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates a file of random numbers and then sorts it. A detailed explanation the script is available here.
#!/bin/bash
#
#SBATCH -p Loren # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 100 # memory pool for all cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
donesort SomeRandomNumbers.txt
Now you can submit your job with the command:
sbatch myscript.sh
If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):
sbatch --test-only myscript.sh
Checking queued jobs
To see what is currently running in the cluster, you can use the “arccjobs” command, which shows you all the job in the cluster currently submitted to the cluster.
To see specific information about what is in the job queue you use the “squeue” command. This command given no parameters display all queued jobs.
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List priority order of jobs for the current user (you) in a given partition:
showq-slurm -o -u -q <partition>
List all current jobs in the shared partition for a user:
squeue -u <username> -p shared
List jobs run by the current user since a certain date:
sacct --starttime 2023-09-27
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
List status info for a currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
To view the job script and environment:
showjob <jobid>
Controlling Jobs
To cancel/delete a job use the “scancel” command providing the jobid to be cancelled. Here are some additional commands you can use.
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To hold a particular job from being scheduled:
scontrol hold <jobid>
To release a particular job to be scheduled:
scontrol release <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>