Job Scheduling on ARCC HPC Systems

This section reflects the general ARCC policy for scheduling jobs on all HPC systems administered by ARCC. Since the purpose of HPC systems varies from system to system, please refer to the specific system below for policies particular to that system.

Beartooth

Overview

This policy reflects the ARCC policy for scheduling jobs on Beartooth, specifically. Beartooth won't offer the traditional relationship between users and queues. Rather, Beartooth offers one, all-encompassing pool of nodes and will regulate usage using node reservations and job prioritization.

...

saving the program state, usually to stable storage, so that it may be reconstructed later in time

Details

Queuing

ARCC will use Slurm to manage Beartooth. Beartooths's compute resources will be defined as one large queue. From there ARCC will use Slurm's fairshare, reservations, and prioritization functionality to control Beartooths's resource utilization.

Reservations will be defined for communal and individual invested users. Communal users will have access control settings that will provide preferential access to the communal reservation. Likewise, invested users will have preferential access to purchased resource levels. By default, all reservations will be shared.

Prioritization

Slurm will track resource utilization based on a job's actual consumption of resources and update fairshare resource utilization statistics. These statistics will influence the priority of subsequent jobs submitted by a user. Greater utilization of resources reduces the priority of follow-on jobs submitted by a particular user.

Priority decreases with resource (time and compute) usage. Priority increases or "recovers" over time.

Job Preemption

Guest jobs running on a invested nodes will be preempted when necessary to provide resources to a job submitted by the owner of that reservation. Slurm will wait to terminate a job for 5 min after a job has been submitted by an invested user. Preempted jobs are automatically re-queued, if the re-queue job flag was used.

Check-Pointing

Because of the massive resource overhead involved in OS or cluster level checkpointing, ARCC won't offer check-pointing. However, users are strongly encouraged to build check-pointing into their own code. This may affect code performance but will provide a safety-net.

Job Submittal Options/Limitations

Users who wish to may restrict their job to run within a reservation to which they have preferential access may do so. This may result in an extended scheduling wait time.
Users who need more resources than are available in any one reservation are welcome to use available resources in other reservations but need to be aware that the job runs the risk of being terminated without warning. Users who choose this option are encouraged to use check-pointing.
Users with high demand, short-duration jobs are encouraged to coordinate with other research groups to acquire unrestricted access to their reservations.
Arrangements can be made for users who want the entirety of the cluster for very short duration jobs.
When submitting jobs, please note that jobs are restricted to a 7-day wall clock.

Example Scenarios

Please be aware that the scenarios below are over-simplified, often glossing over some variables in order to illustrate the spotlighted situation. Some aspects are exaggerated for effect.

...

Versions Compared

Old Version 6

New Version Current

Key

Job Scheduling on ARCC HPC Systems

Beartooth

Overview

Details

Queuing

Prioritization

Job Preemption

Check-Pointing

Job Submittal Options/Limitations

Example Scenarios

Page Comparison

Versions Compared

Old Version 6

New Version Current

Key

Job Scheduling on ARCC HPC Systems

Beartooth

Overview

Details

Queuing

Prioritization

Job Preemption

Check-Pointing

Job Submittal Options/Limitations

Example Scenarios