Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents

Glossary

Frequently Asked Questions

...

Required Inputs and Default Values and Limits

...

The value of interactive jobs is to allow users to work interactively with the CLI or interactive use of debuggers (ddt, gdb) , profilers (map, gprof), or language interpreters such as Python, R, or Julia.

Special Hardware / Configuration Requests

Slurm is a flexible and powerful workload manager. It has been configured to allow very good expressiveness to allocate certain features of nodes and specialized hardware. Certain features are requested by the use of Generic Resource or GRES while others are requested through the constraints option.

GPU Requests

Request that 16 cpus 2 GPUs be requested for an interactive session:

Code Block
 $ salloc -A arcc --time=40:00 -N 1 --ntasks-per-node=1 --cpus-per-task=16 --gres=gpu:2 

Request 16 cpus, 1 GPU of type P100 in a batch script:

Code Block
#!/bin/bash

#SBATCH --account=arcc
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-tasks=16
#SBATCH --gres:P100:1

srun gpu_application

Long Job QOS Configuration

To allow projects to temporarily run jobs for 14 days ARCC has established a special QOS (long-jobs-14) with the following limits:

14-day wall clock time limit
10 max running jobs
As needed ARCC can create other QOS's as needed with different limits.

QOS Creation

To create the QOS for this feature with the issue the following commands as root on tmgt1.

Code Block
sacctmgr add qos <QOS name> set Flags=PartitionTimeLimit MaxWall=14-0 MaxJobsPA=10

As an example to create a 14 day wall time and max 10 running jobs

Code Block
sacctmgr add qos long-jobs-14 set Flags=PartitionTimeLimit MaxWall=14-0 MaxJobsPA=10

Allow Access to the QOS

Once the QOS with the proper limits has been created you need to apply it to the project.

Code Block
sacctmgr modify account <project name> where cluster=teton set qos+=long-jobs-14

Now that you have enabled the long-job-14 QOS on a project inform the users to add:

Code Block
--qos=long-jobs-14

to there salloc, sbatch or srun command.

Remove Access to the QOS

Once the requirement for the project to run longer jobs is no longer required to remove access for the project to the QOS.

Code Block
sacctmgr modify account <project name> where cluster=teton set qos-=long-jobs-14

Examples

Example 1

In the following example, we use the ARCC as our project example. We want to give ARCC access to run longer jobs. We assume that the "long-jobs-14" QOS has been previously been created.

  • We run the command "assoc" which return the following definition for ARCC from the slurm database:

Code Block
            Account       User   Def QOS                  QOS 
-------------------- ---------- --------- --------------------

inv-arcc                            arcc          arcc,normal 
 arcc                               arcc          arcc,normal
  arcc                awillou2      arcc          arcc,normal 
  arcc                dperkin6      arcc          arcc,normal 
  arcc                 jbaker2      arcc          arcc,normal 
  arcc                  jrlang      arcc          arcc,normal 
  arcc                mkillean      arcc          arcc,normal 
  arcc                powerman      arcc          arcc,normal 
  arcc                salexan5      arcc          arcc,normal

This shows the default configuration for the QOS setup, "arcc" being the default QOS all arcc jobs run under. While :arcc: project users have access to either the "normal" or "arcc" QOS.

  • We want to give the "arcc" project access to the 14-day job runtime feature, we do this by adding the proper QOS to the ARCC project

Code Block
sacctmgr modify account arcc where cluster=teton set qos+=long-jobs-14
  • To verify the QOS has been added to the "arcc" project we run the "assoc" command as root

Code Block
            Account       User   Def QOS                  QOS 
-------------------- ---------- --------- --------------------

inv-arcc                            arcc            arcc,normal 
 arcc                               arcc arcc,long-job-14,norm+ 
  arcc                awillou2      arcc arcc,long-job-14,norm+
  arcc                dperkin6      arcc arcc,long-job-14,norm+
  arcc                 jbaker2      arcc arcc,long-job-14,norm+
  arcc                  jrlang      arcc arcc,long-job-14,norm+
  arcc                mkillean      arcc arcc,long-job-14,norm+
  arcc                powerman      arcc arcc,long-job-14,norm+
  arcc                salexan5      arcc arcc,long-job-14,norm+

Notes

  • Do we advertise this?

Code Block
 Keep it under wraps for now since this will be allowed on a per request basis.
  • How do we stop people from abusing it?

Code Block
 There are a couple of things in place to keep from abusing this:

We allow only a maximum of 10 jobs running under this QOS ARCC must enable access to the long-job-14 QOS.

By default, we don't attach this QOS to projects. Once the requirement for the project to run long jobs is over we will remove the QOS from the project.

Trouble Shooting

  • Node won't come online

If a node won't come online for some reason check the node information for a slurm reason. run

Code Block
scontrol show node=XXX

The command output should include a reason for why slurm won't bring the node online. As an example:

Code Block
root@tmgt1:/apps/s/lenovo/dsa# scontrol show node=mtest2
NodeName=mtest2 Arch=x86_64 CoresPerSocket=10 
   CPUAlloc=0 CPUTot=20 CPULoad=0.02
   AvailableFeatures=ib,dau,haswell,arcc
   ActiveFeatures=ib,dau,haswell,arcc
   Gres=(null)
   NodeAddr=mtest2 NodeHostName=mtest2 Version=18.08
   OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 
   RealMemory=64000 AllocMem=0 FreeMem=55805 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=arcc 
   BootTime=06.08-11:44:57 SlurmdStartTime=06.08-11:47:35
   CfgTRES=cpu=20,mem=62.50G,billing=20
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@06.10-10:00:27]

This indicates that the memory definition for the node and what Slurm actually found are different. You can use

Code Block
free -m

to see what the system thinks it has in terms of memory.

The node definition should have a memory definition less or equal to the total showed by the "free" command. You should verify that the settings are correct for the memory the node should have. If not, investigate and determine why the discrepancy.

Configuring Slurm for Investments

The Teton cluster is the University of Wyoming's Condo cluster which provides computing resources to the general UW research community. Being a condo cluster researchers can invest funds into the cluster in order to expand its functionality. As an investor, a researcher is afforded special privileges specifically first access to the nodes their funds purchased.

To establish an investment within Slurm follow the following steps:

  1. First, define an investor partition that refers to the purchased nodes. Create the partition definition, edit /apps/s/slurm/latest/etc/partitions-invest.conf. Add

Code Block
# Comment describing the investment
PartitionName=inv-<investment-name> AllowQos=<investment-name> \
  Default=No \
  Priority=10 \
  State=UP \
  Nodes=<nodelist> \
  PreemptMode=off \
  TRESBillingWeights="CPU=1.0,Mem=.00025"

Where:
  • investment-name is the name you wish to call the new investment

  • nodelist is the list of nodes to be included in the investment definition, i.e. t[305-315],t317

  • Adjust the TRESBillingWeights accordingly based on the node specifications

Code Block
Note: The nodes should also be added to the general partition list, i.e. teton
  1. Once you have checked and re-checked your work for correctness configure slurm with the new partition definition:

Code Block
scontrol reconfigure

For the following you will need access to two ARCC created commands:

  • add_slurm_inv

  • add_project_to_inv

  1. Now that you have the investor partition setup you need to create the associated Slurm DB entries. First, run

Code Block
/root/bin/idm_scripts/add_slurm_inv inv-<investment-name>

This will create the investor umbrella account that ties the investment to projects.

  1. Now add the investor project to the investor umbrella account.

...

...