Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleClick to View - Account Policy

HPC/HPS Accounts

Overview

HPC/HPS accounts are available for all University faculty, staff and students for the purpose of research.

Account Sponsorship by a PI

  • All accounts must be sponsored by a University of Wyoming Principal Investigator (PI). Sponsorship gives permission to a user to use the HPC/HPS resource allocations of the sponsor.

  • PIs can sponsor multiple users. Users can be sponsored by multiple PIs.

  • HPC/HPS resource allocations are granted to PIs. Users utilize resources from their sponsor's allocations.

General Terms of Use

The following conditions apply to all account types. Additional details on how the different account types of work can be found elsewhere on this page.

  • All HPC accounts are for academic purposes only.

  • Commercial activities are prohibited.

  • Password sharing and all other forms of account sharing are prohibited.

  • Account-holders found to be circumventing the system policies and procedures will have their accounts locked or removed.

Account requests

All HPC accounts can be requested through the ARCC Access Request Form. Note that all requests, for creating projects and adding users to a project, must be made by the project PI.


For questions about HPC account procedures not addressed below, please contact ARCC.

DefinitionsPrincipal Investigator (PI) Account account of a faculty member who has an extended-term position with UW (e.g. not Adjunct Faculty)Sponsored Researcher a member (e.g. Student/Graduate Assistant, Faculty, or Researcher from another Institution) of a research project for which a UW faculty member is the PISystem Account account of a staff member who has a permanent relationship with UW

Account Types

PI Accounts

PI accounts are for individual PIs only. These accounts are for research only and are not to be shared with anyone else. These accounts are subject to periodic review and can be deleted if the account holders change their University affiliation or fail to comply with UW and ARCC account policies.

Sponsored Accounts

PIs may sponsor any number of accounts, but these accounts must be used for research only. UW faculty are responsible for all of their sponsored account users. These accounts are subject to periodic review and will be deleted if the sponsoring faculty or the account holders change their University affiliation or fail to comply with UW and ARCC account policies.

Instructional Accounts

PIs may sponsor HPC accounts and projects for instructional purposes on the ARCC systems by submitting a request through the ARCC Access Request Form. Instructional requests are subject to denial only when the proposed use is inappropriate for the systems and/or when the instructional course would require resources that exceed available capacity on the systems or substantially interfere with research computations. HPC accounts for instructional purposes will be added by the Sponsor into a separate group created with the 'class group' designation. Class group membership is to be sponsored for one semester and the Sponsor will remove the group at the end of the semester. Class/Instructional group jobs should only be submitted to the 'class' queue, which will be equivalent in priority to the 'windfall' queue, and only available on the appropriate nodes of the ARCC systems.

System Accounts

System accounts are for staff members who have a permanent relationship with UW and are responsible for system administration.

Account Lifecycle

Account Creation

ARCC HPC/HPS accounts will be created to match existing UWYO accounts whenever possible. PIs may request accounts for existing projects/allocations or courses.

Account Renewal

Account Transfer

A PI who is leaving the project or the University can request that their project be transferred to a new PI. Any non-PI accounts can be transferred from one PI's zone of control to another as necessary as students move working from one researcher to another. Account transfer requests will also be made by contacting the Help Desk (766-4357).

Account Termination

The VP of Research, the University and UW CIO, and University Provost comprise the University of Wyoming's Research Computing Executive Steering Committee (UW-ESC). The UW-ECS will govern the termination of Research Computing accounts, following other University policies as needed. Non-PI accounts may be terminated at the request of the UW-ESC. Any users found in violation of this Research Computing Allocation Policy or any other Policies may have access to their accounts suspended for review by the Director of Research Support, IT and the UW-ESC.

Job Scheduling Policy

Expand
titleClick to View - Job Scheduling Policy

Job Scheduling on ARCC HPC Systems

This section reflects the general ARCC policy for scheduling jobs on all HPC systems administered by ARCC. Since the purpose of HPC systems varies from system to system, please refer to the specific system below for policies particular to that system.

Teton

Overview

This policy reflects the ARCC policy for scheduling jobs on Teton, specifically. Teton won't offer the traditional relationship between users and queues. Rather, Teton offers one, all-encompassing pool of nodes and will regulate usage using node reservations and job prioritization.

Definitions/descriptions

QoS  

  • Quality of Service

Slurm  

  • a cluster workload management package, that integrates the scheduling, managing, monitoring, and reporting of cluster workloads

Fairshare  

  • Slurm's implementation of fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Slurm's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels.

Reservations

  • under Slurm, reservation is a method of setting aside resources or time for utilization by members of an access control list. Reservations function much like traditional queues in that resources are targeted toward particular functions, but with greater granularity and flexibility.

Check-pointing 

  • saving the program state, usually to stable storage, so that it may be reconstructed later in time

Details

Queuing

ARCC will use Slurm to manage Teton. Teton's compute resources will be defined as one large queue. From there ARCC will use Slurm's fairshare, reservations, and prioritization functionality to control Teton's resource utilization.

Reservations will be defined for communal and individual invested users. Communal users will have access control settings that will provide preferential access to the communal reservation. Likewise, invested users will have preferential access to purchased resource levels. By default, all reservations will be shared.

Prioritization

Slurm will track resource utilization based on a job's actual consumption of resources and update fairshare resource utilization statistics. These statistics will influence the priority of subsequent jobs submitted by a user. Greater utilization of resources reduces the priority of follow-on jobs submitted by a particular user.

Priority decreases with resource (time and compute) usage. Priority increases or "recovers" over time.

Job Preemption

Guest jobs running on a reservation will be preempted when necessary to provide resources to a job submitted by the owner of that reservation. Slurp will wait to terminate a job for 5 min after a job has been submitted by an invested user. Preempted jobs are automatically re-queued.

Check-Pointing

Because of the massive resource overhead involved in OS or cluster level checkpointing, ARCC won't offer check-pointing. However, users are strongly encouraged to build check-pointing into their own code. This may affect code performance but will provide a safety-net.

Job Submittal Options/Limitations

  • Users who wish to may restrict their job to run within a reservation to which they have preferential access may do so. This may result in an extended scheduling wait time.

  • Users who need more resources than are available in any one reservation are welcome to use available resources in other reservations but need to be aware that the job runs the risk of being terminated without warning. Users who choose this option are encouraged to use check-pointing.

  • Users with high demand, short-duration jobs are encouraged to coordinate with other research groups to acquire unrestricted access to their reservations.

  • Arrangements can be made for users who want the entirety of the cluster for very short duration jobs.

  • When submitting jobs, please note that jobs are restricted to a 7-day wall clock.

Example Scenarios

Please be aware that the scenarios below are over-simplified, often glossing over some variables in order to illustrate the spotlighted situation. Some aspects are exaggerated for effect.

Job Scheduling and Termination

Professor X has purchased six nodes; as a result, she has a six-node reservation that may include any six nodes that have the same performance stats as the nodes she purchased. One day Professor Y has a pair of guest jobs running on two of X's nodes. Professor X also has a job running on three of her nodes then launches another job that requires another four nodes. Only one node of Professor X's reservation is available. Terminating Professor Y's two jobs won't free up enough space, so Slurm looks at the communal reservation and finds three available nodes. These three plus the one remaining in X's reservation meet X's need. Slurm allocates the resources and the job starts running.

Later Professor X launches another job that requires one node. Slurm again checks her reservation and finds that terminating one of Professor Y's jobs will free up sufficient resources for the job. Slurm kills one of Professor Y's jobs and allows Professor X's new job to start. Professor Y's job is re-queued, but since it was not check-pointed, it must start over from the beginning when it is once again allowed to run.

Job Priority

Dr. Zed uses fifteen of the twenty nodes in the communal pool for a big, three-day job. When the job completes, he looks at the data and immediately submits another job of similar size but has to wait for the job to be scheduled because the first job reduced his priority below that of most of the other uses of the communal pool. Several days pass during which multiple small jobs are submitted and run ahead of Dr. Zed's next big job.

Student Fred has been running a job on one node for four weeks and when it finally completes, Fred's priority has dropped below that of Dr. Zed. One node isn't enough for Dr. Zed's job, so now both Fred and Zed are waiting. One day later Zed's priority has increased while the priorities of other uses have decreased to the point where Dr. Zed has top priority. Unfortunately, there aren't enough nodes available for his job so he still has to wait. However, because Dr. Zed has top priority no other jobs are scheduled in front of him. Two days later enough nodes have become available for Dr. Zed's job. He's happy to see his job start and, when it completes three days later, Dr. Zed is back at the bottom of the priority food chain.

...