Slurm is the basis of which all jobs are to be submitted, this includes batch and interactive jobs. Slurm consists of several facing user commands, all of which have appropriate Unix man pages associated with them and should be consulted.
Contents
Glossary
There are some default limits set for Slurm jobs. By default the following is required for submission:
Walltime limit(--time=[days-hours:mins:secs]
Project account--account=account
Default Values
Additionally, the default submission has the following characteristics:
nodes is for one node (-N 1, --nodes=1)task count one tasks (-n 1, --ntasks-per-node=1)memory amount 1000 MB RAM / CPU (--mem-per-cpu=1000).
These can be changed by requesting different allocation schemes by modifying the appropriate flags. Please reference our Slurm documentation.
Default Limits
On Mount Moran, the default limits were specifically represented by concurrently used cores by each project account. Investors received an increase in concurrent core usage capability. To facilitate more flexible scheduling for all research groups, ARCC is looking at implementing limits based on concurrent usage of cores, memory, and walltime of jobs. This will be defined in the near future and will be subject to the FAC review.
Partitions
Click to View - Patitions
The Slurm configuration on Teton is quite complicated to help with the layout of hardware, investors, and runtime limits. The following tables represent the partition on Teton. Some require a QoS which will be auto-assigned during job submission. The tables represent the Slurm allocatable units rather than hardware units.
General Partitions
| | | | | | | Teton General Slurm Partitions |
Partition | Max Walltime | Node Cnt | Core Cnt | Thds / Core | CPUS | Mem (MB) / Node | Req'd QoS |
---|
Moran | 7-00:00:00 | 284 | 4544 | 1 | 4544 | 64000 or 128000 | N/A |
teton | 7-00:00:00 | 180 | 5760 | 1 | 5760 | 128000 | N/A |
teton-gpu | 7-00:00:00 | 8 | 256 | 1 | 256 | 512000 | N/A |
teton-hugemem | 7-00:00:00 | 10 | 256 | 1 | 256 | 1024000 | N/A |
teton-knl | 7-00:00:00 | 12 | 864 | 4 | 3456 | 384000 | N/A |
Investor Partitions
Investor partitions are likely to be quite heterogeneous and may have a mix of hardware and are indicated below where appropriate. They require a special QoS for access.
| | | | | | | | | Teton Investor Slurm Partitions |
Partition | Max Walltime | Node Cnt | Core Cnt | Thds / Core | Mem (MB) / Node | Req'd QoS | Preemption | Owner | Associated Projects |
---|
inv-arcc | Unlimited | 2 | 44 | 1 | 64000 or 192000 | TODO | Disabled | Jeffrey Lang | arcc |
inv-atmo2grid | 7-00:00:00 | 31 | 496 | 1 | 64000 | TODO | Disabled | Dr. Naughton, Dr. Mavriplis, Dr. Stoellinger | turbmodel, rotarywingcfg |
inv-chemistry | 7-00:00:00 | 6 | 96 | 1 | 128000 | TODO | Disabled | Dr. Hulley | hulleylab, pahc, chemcal |
inv-clune | 7-00:00:00 | 16 | 256 | 1 | Mixed | TODO | Disabled | Dr. Clune | evolvingai, iwctml |
inv-compmicrosc | 7-00:00:00 | 6 | 96 | 1 | 128000 | TODO | Disabled | Dr. Aidey (Composite Micro Sciences) | rd-hea |
inv-compsci | 7-00:00:00 | 12 | 288 | 1 | 384999 | TODO | Disabled | Dr. Lars Kotthoff | mallet |
inv-fertig | 7-00:00:00 | 1 | 16 | 1 | 128000 | TODO | Disabled | Dr. Fertig | gbfracture |
inv-geology | 7-00:00:00 | 16 | 256 | 1 | 64000 | TODO | Disabled | Dr.Chen, Dr. Mallick | inversion, f3dt, geologiccarbonseq, stochasticaquiferinv |
inv-inbre | 7-00:00:00 | 24 | 160 | 1 | 128000 | TODO | Disabled | Dr. Blouin | inbre-train, inbreb, inbrev, human_microbiome |
inv-jang-condel | 7-00:00:00 | 2 | 32 | 1 | 128000 | TODO | Disabled | Dr. Jang-Condel | exoplanets, planets |
inv-liu | 7-00:00:00 | 4 | 64 | 1 | 128000 | TODO | Disabled | Dr. Liu | gwt |
inv-microbiome | 7-00:00:00 | 85 | 2816 | 1 | 128000 | TODO | Disabled | Dr. Ewers | bbtrees, plantanalytics |
inv-multicfd | 7-00:00:00 | 11 | 352 | 1 | 128000 | TODO | Disabled | Dr. Mousaviraad ,mechanical engineering | multicfd |
inv-physics | 7-00:00:00 | 4 | 128 | 1 | 128000 | TODO | Disabled | Dr. Dahnovsky | euo, 2dferromagnetism, d0ferromagnetism, microporousmat |
inv-wagner | 7-00:00:00 | 2 | 32 | 1 | 128000 | TODO | Disabled | Dr. Wagner | wagnerlab, latesgenomics, ltcichlidgenomics, phylogenref, ysctrout |
Special Partitions
Special partitions require access to be given directly to user accounts or project accounts and likely require additional approval for access.
Partition | Max Walltime | Node Cnt | Core Cnt | Thds / Core | Mem (MB) / Node | Owner | Associated Projects | Notes |
---|
dgx | 7-00:00:00 | 2 | 40 | 2 | 512000 | Dr. Clune | See partition inv-clune above | NVIDIA V100 with NVLink, Ubuntu 16.04 |
inv-compsci | 7-00:00:00 | 12 | 72 | 4 | 512000 | Dr. Kotthoff | See partition inv-compsci above | This includes the KNL nodes only |
More Details
Generally, to run a job on a cluster you will need the following:
A handy migration reference to compare MOAB/Torque commands to SLURM commands can be found on the SLURM home site: Batch System Rosetta Stone.
Commands
Click to View - Commands
sacct
salloc
sbatch
Submit a batch job consisting of a single job or job array. Several methods can be used to submit batch jobs. A script file can be used and provided as an argument on the command line. Alternatively, and rarer, the use of standard input can be used and the batch job can be created interactively. We recommend writing the batch job in a script so that it may be referenced at a later time.
scancel
Cancel jobs after submission. Works on pending and running jobs. By default, provide a jobid or set of jobids to cancel. Alternatively, one can use sets of flags to cancel specific jobs relating to account, name, partition, qos, reservation, nodelist. To cancel all array tasks, specify the parent jobid.
sinfo
squeue
sreport
srun
A front-end launcher for job steps which includes serial and parallel jobs. srun can be considered an equivalent to mpirun or mpiexec when launching MPI jobs. Using srun inside a job is defined to be a job step that provides accounting information relating to memory, cpu time, and other parameters that are valuable when a job terminates unexpectedly or historical information is needed.
Batch Jobs
Batch jobs are jobs that are submitted via job script or commands that are input into the sbatch command interactively which will then enter the queueing system and prepare for the execution, then execute when possible. The execution could start immediately if the queue is not completely full, start after a short time period if preemption opted for, or after extensive time if the queue is full or running limits are already reached.
A simple sbatch script to submit a simple "Hello World!" type problem follows:
#!/bin/bash
### Assume this file is named hello.sh
#SBATCH --account=arcc
#SBATCH --time=24:00:00
echo "Hello World!"
The two '#SBATCH' directives above are required for all job submissions, whether interactive or batch. The values to account should be changed to the appropriate project account and the time should be changed to an appropriate walltime limit. This is walltime limit, not CPU time. These values could also be supplied when submitting jobs by providing them directly on the command line when submitting. Slurm will default jobs to use one node, one task per node, and once cpu per node.
Submitting Jobs
or, with account and time on the command line directly rather than as directives in the shell script:
$ sbatch --account=arcc --time=24:00:00 test.sh
Single Node, Multi-Core Jobs
Slurm creates allocations of resources and resources can vary depending on the work needing to be done with the cluster. A batch job that requires multiple cores can have a few different layouts depending on what is intending to be run. If the job is a multi-threaded application such as OpenMP or utilizes pthreads, it's best to set the number of tasks to 1. The below script will request that a single node with 4 cores available. The job script, assuming OpenMP, sets the number of threads to the job provided environment variable SLURM_CPUS_PER_TASK.
#!/bin/bash
#SBATCH --account=arcc
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
export $OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./application
Single Node, Multi-Tasks
This could be a multi-tasked job where the application has it's own parallel processing engine or uses MPI, but experiences poor scaling over multiple nodes.
#!/bin/bash
#SBATCH --account=arcc
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
### Assuming MPI application
srun ./application
Multi-Node, Non-Multithreaded
An application that strictly uses MPI often can use multiple nodes. However, there is often a chance that MPI type programs do not implement multithreading capability. Therefore, the number of cpus per task should be set to a value of 1.
#!/bin/bash
#SBATCH --account=arcc
#SBATCH --time=24:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
### Assuming 'application' is on your $PATH environment variable
srun application
Multi-Node, Multithreaded
Some applications have been developed to take advantage of both distributed memory parallelism and shared memory parallelism such that they're capable of using MPI and threading together. This often requires the user to find the right balance based on additional resources required such as memory per task, network bandwidth, and node core count. The below example request that 4 nodes be allocated, each supporting 4 MPI ranks and each MPI rank supporting 4 threads. The total CPU request count aggregates to 64 (i.e., 4 x 4 x 4).
#!/bin/bash
#SBATCH --account=arcc
#SBATCH --time=24:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun application -arg1 -arg2
Checking Status and Canceling
You can use the squeue command to display the status of all your jobs:
and scancel to delete a particular job from the queue:
Viewing the Results
Once your job has completed, you should see two files in the directory from which you submitted the job. By default, these will be named <jobname>.oXXXXX and <jobname>.eXXXXX (where the <jobname> is replaced by the name of the SLURM script and the X's are replaced by the numerical portion of the job identifier returned by sbatch). In the Hello World example, any output from the job sent to "standard output" will be written to the hello.oXXXXX file and any output sent to "standard error" will be written to the hello.eXXXXX file.
Interactive Jobs
Interactive jobs are jobs that allow shell access to computing nodes where applications can be run interactively, heavy processing of files, or compiling large applications. They can be requested with similar arguments to batch jobs. ARCC has configured the clusters such that Slurm interactive allocations will give shell access on the compute nodes themselves rather than keeping the shell on the login node. The salloc command is appropriate to launch interactive jobs.
$ salloc --account=arcc --time=40:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=8
The value of interactive jobs is to allow users to work interactively with the CLI or interactive use of debuggers (ddt, gdb) , profilers (map, gprof), or language interpreters such as Python, R, or Julia.
Special Hardware / Configuration Requests
Slurm is a flexible and powerful workload manager. It has been configured to allow very good expressiveness to allocate certain features of nodes and specialized hardware. Certain features are requested by the use of Generic Resource or GRES while others are requested through the constraints option.
GPU Requests
Request that 16 cpus 2 GPUs be requested for an interactive session:
$ salloc -A arcc --time=40:00 -N 1 --ntasks-per-node=1 --cpus-per-task=16 --gres=gpu:2
Request 16 cpus, 1 GPU of type P100 in a batch script:
#!/bin/bash
#SBATCH --account=arcc
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-tasks=16
#SBATCH --gres:P100:1
srun gpu_application
Long Job QOS Configuration
To allow projects to temporarily run jobs for 14 days ARCC has established a special QOS (long-jobs-14) with the following limits:
14-day wall clock time limit
10 max running jobs
As needed ARCC can create other QOS's as needed with different limits.
QOS Creation
To create the QOS for this feature with the issue the following commands as root on tmgt1.
sacctmgr add qos <QOS name> set Flags=PartitionTimeLimit MaxWall=14-0 MaxJobsPA=10
As an example to create a 14 day wall time and max 10 running jobs
sacctmgr add qos long-jobs-14 set Flags=PartitionTimeLimit MaxWall=14-0 MaxJobsPA=10
Allow Access to the QOS
Once the QOS with the proper limits has been created you need to apply it to the project.
sacctmgr modify account <project name> where cluster=teton set qos+=long-jobs-14
Now that you have enabled the long-job-14 QOS on a project inform the users to add:
to there salloc, sbatch or srun command.
Remove Access to the QOS
Once the requirement for the project to run longer jobs is no longer required to remove access for the project to the QOS.
sacctmgr modify account <project name> where cluster=teton set qos-=long-jobs-14
Examples
Click to View - Examples
Example 1
In the following example, we use the ARCC as our project example. We want to give ARCC access to run longer jobs. We assume that the "long-jobs-14" QOS has been previously been created.
Account User Def QOS QOS
-------------------- ---------- --------- --------------------
inv-arcc arcc arcc,normal
arcc arcc arcc,normal
arcc awillou2 arcc arcc,normal
arcc dperkin6 arcc arcc,normal
arcc jbaker2 arcc arcc,normal
arcc jrlang arcc arcc,normal
arcc mkillean arcc arcc,normal
arcc powerman arcc arcc,normal
arcc salexan5 arcc arcc,normal
This shows the default configuration for the QOS setup, "arcc" being the default QOS all arcc jobs run under. While :arcc: project users have access to either the "normal" or "arcc" QOS.
sacctmgr modify account arcc where cluster=teton set qos+=long-jobs-14
Account User Def QOS QOS
-------------------- ---------- --------- --------------------
inv-arcc arcc arcc,normal
arcc arcc arcc,long-job-14,norm+
arcc awillou2 arcc arcc,long-job-14,norm+
arcc dperkin6 arcc arcc,long-job-14,norm+
arcc jbaker2 arcc arcc,long-job-14,norm+
arcc jrlang arcc arcc,long-job-14,norm+
arcc mkillean arcc arcc,long-job-14,norm+
arcc powerman arcc arcc,long-job-14,norm+
arcc salexan5 arcc arcc,long-job-14,norm+
Notes
Keep it under wraps for now since this will be allowed on a per request basis.
There are a couple of things in place to keep from abusing this:
We allow only a maximum of 10 jobs running under this QOS ARCC must enable access to the long-job-14 QOS.
By default, we don't attach this QOS to projects. Once the requirement for the project to run long jobs is over we will remove the QOS from the project.
Trouble Shooting
Click to View - Trouble Shooting
If a node won't come online for some reason check the node information for a slurm reason. run
The command output should include a reason for why slurm won't bring the node online. As an example:
root@tmgt1:/apps/s/lenovo/dsa# scontrol show node=mtest2
NodeName=mtest2 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUTot=20 CPULoad=0.02
AvailableFeatures=ib,dau,haswell,arcc
ActiveFeatures=ib,dau,haswell,arcc
Gres=(null)
NodeAddr=mtest2 NodeHostName=mtest2 Version=18.08
OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018
RealMemory=64000 AllocMem=0 FreeMem=55805 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=arcc
BootTime=06.08-11:44:57 SlurmdStartTime=06.08-11:47:35
CfgTRES=cpu=20,mem=62.50G,billing=20
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@06.10-10:00:27]
This indicates that the memory definition for the node and what Slurm actually found are different. You can use
to see what the system thinks it has in terms of memory.
The node definition should have a memory definition less or equal to the total showed by the "free" command. You should verify that the settings are correct for the memory the node should have. If not, investigate and determine why the discrepancy.
Configuring Slurm for Investments
Click to View - Configuring Slurm for Investments
The Teton cluster is the University of Wyoming's Condo cluster which provides computing resources to the general UW research community. Being a condo cluster researchers can invest funds into the cluster in order to expand its functionality. As an investor, a researcher is afforded special privileges specifically first access to the nodes their funds purchased.
To establish an investment within Slurm follow the following steps:
First, define an investor partition that refers to the purchased nodes. Create the partition definition, edit /apps/s/slurm/latest/etc/partitions-invest.conf. Add
# Comment describing the investment
PartitionName=inv-<investment-name> AllowQos=<investment-name> \
Default=No \
Priority=10 \
State=UP \
Nodes=<nodelist> \
PreemptMode=off \
TRESBillingWeights="CPU=1.0,Mem=.00025"
Where:
investment-name is the name you wish to call the new investment
nodelist is the list of nodes to be included in the investment definition, i.e. t[305-315],t317
Adjust the TRESBillingWeights accordingly based on the node specifications
Note: The nodes should also be added to the general partition list, i.e. teton
Once you have checked and re-checked your work for correctness configure slurm with the new partition definition:
For the following you will need access to two ARCC created commands:
add_slurm_inv
add_project_to_inv
Now that you have the investor partition setup you need to create the associated Slurm DB entries. First, run
/root/bin/idm_scripts/add_slurm_inv inv-<investment-name>
This will create the investor umbrella account that ties the investment to projects.
Now add the investor project to the investor umbrella account.
/root/bin/idm_scripts/add_proj_to_inv inv-<investment-name> <project>