Job Scheduling on ARCC HPC Systems

This section reflects the general ARCC policy for scheduling jobs on all HPC systems administered by ARCC. Since the purpose of HPC systems varies from system to system, please refer to the specific system below for policies particular to that system.

Beartooth

Overview

This policy reflects the ARCC policy for scheduling jobs on Beartooth, specifically. Beartooth won't offer the traditional relationship between users and queues. Rather, Beartooth offers one, all-encompassing pool of nodes and will regulate usage using node reservations and job prioritization.

Definitions/descriptions

QoS

Quality of Service

Slurm

a cluster workload management package, that integrates the scheduling, managing, monitoring, and reporting of cluster workloads

Fairshare

Slurm's implementation of fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Slurm's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels.

Reservations

under Slurm, reservation is a method of setting aside resources or time for utilization by members of an access control list. Reservations function much like traditional queues in that resources are targeted toward particular functions, but with greater granularity and flexibility.

Check-pointing

saving the program state, usually to stable storage, so that it may be reconstructed later in time

Details

Queuing

ARCC will use Slurm to manage Beartooth. Beartooths's compute resources will be defined as one large queue. From there ARCC will use Slurm's fairshare, reservations, and prioritization functionality to control Beartooths's resource utilization.

Reservations will be defined for communal and individual invested users. Communal users will have access control settings that will provide preferential access to the communal reservation. Likewise, invested users will have preferential access to purchased resource levels. By default, all reservations will be shared.

Prioritization

Slurm will track resource utilization based on a job's actual consumption of resources and update fairshare resource utilization statistics. These statistics will influence the priority of subsequent jobs submitted by a user. Greater utilization of resources reduces the priority of follow-on jobs submitted by a particular user.

Priority decreases with resource (time and compute) usage. Priority increases or "recovers" over time.

Job Preemption

Guest jobs running on a invested nodes will be preempted when necessary to provide resources to a job submitted by the owner of that reservation. Slurm will wait to terminate a job for 5 min after a job has been submitted by an invested user. Preempted jobs are automatically re-queued, if the re-queue job flag was used.

Check-Pointing

Because of the massive resource overhead involved in OS or cluster level checkpointing, ARCC won't offer check-pointing. However, users are strongly encouraged to build check-pointing into their own code. This may affect code performance but will provide a safety-net.

Job Submittal Options/Limitations

Users who wish to may restrict their job to run within a reservation to which they have preferential access may do so. This may result in an extended scheduling wait time.
Users who need more resources than are available in any one reservation are welcome to use available resources in other reservations but need to be aware that the job runs the risk of being terminated without warning. Users who choose this option are encouraged to use check-pointing.
Users with high demand, short-duration jobs are encouraged to coordinate with other research groups to acquire unrestricted access to their reservations.
Arrangements can be made for users who want the entirety of the cluster for very short duration jobs.
When submitting jobs, please note that jobs are restricted to a 7-day wall clock.

Example Scenarios

Please be aware that the scenarios below are over-simplified, often glossing over some variables in order to illustrate the spotlighted situation. Some aspects are exaggerated for effect.

Job Scheduling and Termination

Professor X has purchased six nodes; as a result, she has a six-node reservation that may include any six nodes that have the same performance stats as the nodes she purchased. One day Professor Y has a pair of guest jobs running on two of X's nodes. Professor X also has a job running on three of her nodes then launches another job that requires another four nodes. Only one node of Professor X's reservation is available. Terminating Professor Y's two jobs won't free up enough space, so Slurm looks at the communal reservation and finds three available nodes. These three plus the one remaining in X's reservation meet X's need. Slurm allocates the resources and the job starts running.

Later Professor X launches another job that requires one node. Slurm again checks her reservation and finds that terminating one of Professor Y's jobs will free up sufficient resources for the job. Slurm kills one of Professor Y's jobs and allows Professor X's new job to start. Professor Y's job is re-queued, but since it was not check-pointed, it must start over from the beginning when it is once again allowed to run.

Job Priority

Dr. Zed uses fifteen of the twenty nodes in the communal pool for a big, three-day job. When the job completes, he looks at the data and immediately submits another job of similar size but has to wait for the job to be scheduled because the first job reduced his priority below that of most of the other uses of the communal pool. Several days pass during which multiple small jobs are submitted and run ahead of Dr. Zed's next big job.

Student Fred has been running a job on one node for four weeks and when it finally completes, Fred's priority has dropped below that of Dr. Zed. One node isn't enough for Dr. Zed's job, so now both Fred and Zed are waiting. One day later Zed's priority has increased while the priorities of other uses have decreased to the point where Dr. Zed has top priority. Unfortunately, there aren't enough nodes available for his job so he still has to wait. However, because Dr. Zed has top priority no other jobs are scheduled in front of him. Two days later enough nodes have become available for Dr. Zed's job. He's happy to see his job start and, when it completes three days later, Dr. Zed is back at the bottom of the priority food chain.

ARCC Wiki

2.1.3 Job Scheduling Policy