Loren - Job Scheduler (Slurm)

The Loren cluster uses a job scheduler for managing and running job across the cluster. Slurm commands allow the user to submit job, query information about jobs and perform other functions.

Here’s a link to a web page that describes how to use slurm, https://blog.ronin.cloud/slurm-intro/

Cluster nodes are grouped into different Slurm partitions:

Partition

Description

Nodes

Partition

Description

Nodes

Loren

This is the default partition for running jobs on Loren. This partition has most nodes included and should be used for most jobs. These nodes have Nvidia K20Xm GPUs.

loren[01-10,12-19,21-26,28-29,32-35,37-44,46-50]

Loren-k80

This queue has nodes that support the Nvidia K80 GPUs

loren[70-74]

quick

This queue allows for short duration jobs up to one hour.

loren[30-31]

mango

Specialty queue used to run the mango software.

loren[51,53-54,57-60]

Getting help on Slurm Commands

For all the following commands you can use the “man” command to find out more information about using the command.

i.e. man sbatch

Submitting a job

You can start a batch job using the “sbatch” command. This command requires the user to provide the name of a script file to be run that defines the job.

You can also use the “salloc” command to start an interactive job. See the man page for information.

The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates a file of random numbers and then sorts it. A detailed explanation the script is available here.

#!/bin/bash # #SBATCH -p Loren # partition (queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt

Now you can submit your job with the command:

sbatch myscript.sh

If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):

sbatch --test-only myscript.sh

Checking queued jobs

To see what is currently running in the cluster, you can use the “arccjobs” command, which shows you all the job in the cluster currently submitted to the cluster.

To see specific information about what is in the job queue you use the “squeue” command. This command given no parameters display all queued jobs.

  • List all current jobs for a user:
    squeue -u <username>

  • List all running jobs for a user:
    squeue -u <username> -t RUNNING

  • List all pending jobs for a user:
    squeue -u <username> -t PENDING

  • List priority order of jobs for the current user (you) in a given partition:
    showq-slurm -o -u -q <partition>

  • List all current jobs in the shared partition for a user:
    squeue -u <username> -p shared

  • List jobs run by the current user since a certain date:
    sacct --starttime 2023-09-27

  • List detailed information for a job (useful for troubleshooting):
    scontrol show jobid -dd <jobid>

  • List status info for a currently running job:
    sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

  • Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
    To get statistics on completed jobs by jobID:
    sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

  • To view the same information for all jobs of a user:
    sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

  • To view the job script and environment:

showjob <jobid>

Controlling Jobs

To cancel/delete a job use the “scancel” command providing the jobid to be cancelled. Here are some additional commands you can use.

  • To cancel one job:
    scancel <jobid>

  • To cancel all the jobs for a user:
    scancel -u <username>

  • To cancel all the pending jobs for a user:
    scancel -t PENDING -u <username>

  • To cancel one or more jobs by name:
    scancel --name myJobName

  • To hold a particular job from being scheduled:
    scontrol hold <jobid>

  • To release a particular job to be scheduled:
    scontrol release <jobid>

  • To requeue (cancel and rerun) a particular job:
    scontrol requeue <jobid>