Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

This page explains ARCC policies for high-performance computing.


Contents


These policies and procedures are intended to ensure that ARCC HPC facilities are fairly shared, effectively used, and support the University of Wyoming's research programs that rely on computational facilities not available elsewhere at the University.

Definitions

Cluster

  • an assembly of computational hardware designed and configured to function together as a single system, much the way neurons work together to form a brain

Condo  

  • a computational resource that is shared among many users — condo compute resources are used simultaneously by multiple users

HPC

  • high-performance computing generally refers to systems that perform parallel processes at a level above a teraflop or 1012 floating-point operations per second

HPS

  • high-performance storage system, usually a tiered system with media covering a range of speeds to optimize performance while reducing cost

Customer 

  • a person, or group to whom ARCC provides a service

General Policies

For policies that apply to all ARCC resources, see ARCC Policies.

Usage of Login Nodes

 Click to View - Usage of Login Nodes

The login nodes are provided for authorized users to access the Teton cluster.

  • They are intended for people to set up and submit jobs, access results from jobs, transfer data to/from the cluster, etc. As a courtesy to your colleagues, you should refrain from doing anything compute-intensive (any task(s) that uses 100% of a CPU), long-running tasks (over 10 minutes), a large number of tasks that have the same footprint on these nodes as it will interfere with the ability of others to use the node resources. Compute intensive tasks should be submitted as jobs to the compute nodes, as that is what compute nodes are for.

  • Short compilations of code are permissible. If you are doing a very parallel or long compilation, you should consider requesting an interactive job and doing your compilation there as a courtesy to your colleagues.

  • Compute-intensive calculations, etc. are NOT allowed on the login nodes. If system staff find such jobs running, we will kill them without prior notification.

Tasks violating these rules will be terminated immediately and the owner will be warned, and continue violation may result in the suspension of access to the cluster(s). Access will not be restored until the ARCC director receives a written request from the user’s PI.

Do NUT run compute-intensive, long-running tasks, or large numbers of tasks on the login nodes.

Account Policy

 Click to View - Account Policy

HPC/HPS Accounts

Overview

HPC/HPS accounts are available for all University faculty, staff, and students for the purpose of research.

Account Sponsorship by a PI

  • All accounts must be sponsored by a University of Wyoming Principal Investigator (PI). Sponsorship gives permission to a user to use the HPC/HPS resource allocations of the sponsor.

  • PIs can sponsor multiple users. Users can be sponsored by multiple PIs.

  • HPC/HPS resource allocations are granted to PIs. Users utilize resources from their sponsor's allocations.

General Terms of Use

The following conditions apply to all account types. Additional details on how the different account types of work can be found elsewhere on this page.

  • All HPC accounts are for academic purposes only.

  • Commercial activities are prohibited.

  • Password sharing and all other forms of account sharing are prohibited.

  • Account-holders found to be circumventing the system policies and procedures will have their accounts locked or removed.

Account requests

All HPC accounts can be requested through the ARCC Access Request Form. Note that all requests, for creating projects and adding users to a project, must be made by the project PI.


For questions about HPC account procedures not addressed below, please contact ARCC.

DefinitionsPrincipal Investigator (PI) Account account of a faculty member who has an extended-term position with UW (e.g. not Adjunct Faculty)Sponsored Researcher a member (e.g. Student/Graduate Assistant, Faculty, or Researcher from another Institution) of a research project for which a UW faculty member is the PISystem Account account of a staff member who has a permanent relationship with UW

Account Types

PI Accounts

PI accounts are for individual PIs only. These accounts are for research only and are not to be shared with anyone else. These accounts are subject to periodic review and can be deleted if the account holders change their University affiliation or fail to comply with UW and ARCC account policies.

Sponsored Accounts

PIs may sponsor any number of accounts, but these accounts must be used for research only. UW faculty are responsible for all of their sponsored account users. These accounts are subject to periodic review and will be deleted if the sponsoring faculty or the account holders change their University affiliation or fail to comply with UW and ARCC account policies.

Instructional Accounts

PIs may sponsor HPC accounts and projects for instructional purposes on the ARCC systems by submitting a request through the ARCC Access Request Form. Instructional requests are subject to denial only when the proposed use is inappropriate for the systems and/or when the instructional course would require resources that exceed available capacity on the systems or substantially interfere with research computations. HPC accounts for instructional purposes will be added by the Sponsor into a separate group created with the 'class group' designation. Class group membership is to be sponsored for one semester and the Sponsor will remove the group at the end of the semester. Class/Instructional group jobs should only be submitted to the 'class' queue, which will be equivalent in priority to the 'windfall' queue, and only available on the appropriate nodes of the ARCC systems.

System Accounts

System accounts are for staff members who have a permanent relationship with UW and are responsible for system administration.

Account Lifecycle

Account Creation

ARCC HPC/HPS accounts will be created to match existing UWYO accounts whenever possible. PIs may request accounts for existing projects/allocations or courses.

Account Renewal

Account Transfer

A PI who is leaving the project or the University can request that their project be transferred to a new PI. Any non-PI accounts can be transferred from one PI's zone of control to another as necessary as students move working from one researcher to another. Account transfer requests will also be made by contacting the Help Desk (766-4357).

Account Termination

The VP of Research, the University and UW CIO, and University Provost comprise the University of Wyoming's Research Computing Executive Steering Committee (UW-ESC). The UW-ECS will govern the termination of Research Computing accounts, following other University policies as needed. Non-PI accounts may be terminated at the request of the UW-ESC. Any users found in violation of this Research Computing Allocation Policy or any other UWyo Policies may have access to their accounts suspended for review by the Director of Research Support, IT, and the UW-ESC.

Job Scheduling Policy

 Click to View - Job Scheduling Policy

Job Scheduling on ARCC HPC Systems

This section reflects the general ARCC policy for scheduling jobs on all HPC systems administered by ARCC. Since the purpose of HPC systems varies from system to system, please refer to the specific system below for policies particular to that system.

Teton

Overview

This policy reflects the ARCC policy for scheduling jobs on Teton, specifically. Teton won't offer the traditional relationship between users and queues. Rather, Teton offers one, all-encompassing pool of nodes and will regulate usage using node reservations and job prioritization.

Definitions/descriptions

QoS  

  • Quality of Service

Slurm  

  • a cluster workload management package, that integrates the scheduling, managing, monitoring, and reporting of cluster workloads

Fairshare  

  • Slurm's implementation of fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Slurm's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels.

Reservations

  • under Slurm, reservation is a method of setting aside resources or time for utilization by members of an access control list. Reservations function much like traditional queues in that resources are targeted toward particular functions, but with greater granularity and flexibility.

Check-pointing 

  • saving the program state, usually to stable storage, so that it may be reconstructed later in time

Details

Queuing

ARCC will use Slurm to manage Teton. Teton's compute resources will be defined as one large queue. From there ARCC will use Slurm's fairshare, reservations, and prioritization functionality to control Teton's resource utilization.

Reservations will be defined for communal and individual invested users. Communal users will have access control settings that will provide preferential access to the communal reservation. Likewise, invested users will have preferential access to purchased resource levels. By default, all reservations will be shared.

Prioritization

Slurm will track resource utilization based on a job's actual consumption of resources and update fairshare resource utilization statistics. These statistics will influence the priority of subsequent jobs submitted by a user. Greater utilization of resources reduces the priority of follow-on jobs submitted by a particular user.

Priority decreases with resource (time and compute) usage. Priority increases or "recovers" over time.

Job Preemption

Guest jobs running on a reservation will be preempted when necessary to provide resources to a job submitted by the owner of that reservation. Slurp will wait to terminate a job for 5 min after a job has been submitted by an invested user. Preempted jobs are automatically re-queued.

Check-Pointing

Because of the massive resource overhead involved in OS or cluster level checkpointing, ARCC won't offer check-pointing. However, users are strongly encouraged to build check-pointing into their own code. This may affect code performance but will provide a safety-net.

Job Submittal Options/Limitations

  • Users who wish to may restrict their job to run within a reservation to which they have preferential access may do so. This may result in an extended scheduling wait time.

  • Users who need more resources than are available in any one reservation are welcome to use available resources in other reservations but need to be aware that the job runs the risk of being terminated without warning. Users who choose this option are encouraged to use check-pointing.

  • Users with high demand, short-duration jobs are encouraged to coordinate with other research groups to acquire unrestricted access to their reservations.

  • Arrangements can be made for users who want the entirety of the cluster for very short duration jobs.

  • When submitting jobs, please note that jobs are restricted to a 7-day wall clock.

Example Scenarios

Please be aware that the scenarios below are over-simplified, often glossing over some variables in order to illustrate the spotlighted situation. Some aspects are exaggerated for effect.

Job Scheduling and Termination

Professor X has purchased six nodes; as a result, she has a six-node reservation that may include any six nodes that have the same performance stats as the nodes she purchased. One day Professor Y has a pair of guest jobs running on two of X's nodes. Professor X also has a job running on three of her nodes then launches another job that requires another four nodes. Only one node of Professor X's reservation is available. Terminating Professor Y's two jobs won't free up enough space, so Slurm looks at the communal reservation and finds three available nodes. These three plus the one remaining in X's reservation meet X's need. Slurm allocates the resources and the job starts running.

Later Professor X launches another job that requires one node. Slurm again checks her reservation and finds that terminating one of Professor Y's jobs will free up sufficient resources for the job. Slurm kills one of Professor Y's jobs and allows Professor X's new job to start. Professor Y's job is re-queued, but since it was not check-pointed, it must start over from the beginning when it is once again allowed to run.

Job Priority

Dr. Zed uses fifteen of the twenty nodes in the communal pool for a big, three-day job. When the job completes, he looks at the data and immediately submits another job of similar size but has to wait for the job to be scheduled because the first job reduced his priority below that of most of the other uses of the communal pool. Several days pass during which multiple small jobs are submitted and run ahead of Dr. Zed's next big job.

Student Fred has been running a job on one node for four weeks and when it finally completes, Fred's priority has dropped below that of Dr. Zed. One node isn't enough for Dr. Zed's job, so now both Fred and Zed are waiting. One day later Zed's priority has increased while the priorities of other uses have decreased to the point where Dr. Zed has top priority. Unfortunately, there aren't enough nodes available for his job so he still has to wait. However, because Dr. Zed has top priority no other jobs are scheduled in front of him. Two days later enough nodes have become available for Dr. Zed's job. He's happy to see his job start and, when it completes three days later, Dr. Zed is back at the bottom of the priority food chain.

Software Policy

 Click to View - Software Policy

Software Acquisition, Installation, and Support (AIS) Policy

Overview

This document defines the ARCC's software policy regarding software acquisition, installation, and support. ARCC will help UW users through:

  • Consultations on software capabilities and acquisition expertise

  • Guidance with respect to software installation and deployment, and secured/controlled access

  • Support with respect to the efficient and effective functioning of software (where not provided by software vendor)

  • Software updates to maintain currency with the vendors and ARCC systems

General software usage, functionality, and application questions should be addressed by the research computing community. To support the community support model, ARCC will make available a community-driven Known Error Database in the form of a Wiki with discussion board features.

The ARCC will review the software policy annually during the Change Advisory Board meeting with input from the Faculty Advisory Committee (FAC) members.

Software Requests

The ARCC will maintain a list of currently supported software for each HPC system that ARCC supports. Supported software is evaluated biannually. The ARCC reserves the right to discontinue support for underutilized software. Software proposed for discontinued support will be placed in a "deprecated" section of the software modules interface. Movement of applications to a deprecated status will happen twice per year during system upgrades. Faculty can submit a request for continued software support of deprecated applications using the ARCC Resource Request Form.

New software requests must be initiated by a faculty member via the ARCC Resource Request Form. The information on this form (e.g. estimates on the number of users, software and licensing details, software cost, and the timeline for installation) will be utilized to determine the degree of support provided by the ARCC.

Software Acquisition

Financial Support

Whenever possible, researchers are encouraged to use open-source software. The costs for discipline-specific, proprietary software will typically be born by the PIs requesting the software. A small number of one-time 'seed funding' opportunities may be made with input from the FAC. Financial support for the software is dependent on the user base, as follows:

  • Single PI or department: The license for the software is purchased by the user and his/her department. ARCC will handle acquisition, installation, and support on ARCC resources.

  • Multiple PIs or departments in a single College: The license for the software is purchased by the college. ARCC will handle acquisition, installation, and support on ARCC resources.

  • Broad Use (Multiple Colleges): If the software (e.g. MatLab or Mathematica) serves both significant academic and research uses, then funding by will be reviewed by UW-IT. If the software serves significant research uses only; then the ARCC will review funding. ARCC will handle acquisition, installation, and support on ARCC resources.

Technical Support

ARCC staff will provide support in the identification of efficient and cost-effective software packages to meet the research objectives of faculty members. Where suitable the ARCC will coordinate software acquisitions of faculty with similar objectives to better leverage research investment.

Software Installation

Licensing Support

The ARCC, with the help of UW-IT and UW General Council, will help provide guidance on research software licensing and suitable installation/controlled access.

Technical Support

ARCC staff will provide software installation services for software to be run on ARCC resources. Department IT consultants will help with the installation of software on researchers' workstations. The ARCC will also provide centralized license server services for software on ARCC resources when needed.

Software Support

  • Staff will provide software support with respect to efficient and effective software functions (where not provided by the software vendor). The ARCC will also oversee and implement software updates to maintain currency with both the software vendor and ARCC resources.

  • Debuggers will be provided as part of the Moran software suite.

  • ARCC will endeavor to facilitate code development.

Storage Policy

 Click to View - Storage Policy

Overview

Code-named Teton, the ARCC high-performance storage system (HPS) is a high speed, tiered storage system designed to maximize performance while minimizing cost. Teton is intended to be used for storing data that is actively being used.

The following policies discuss the use of this space. In general, the disk space is intended for support of research using the cluster, and as a courtesy to other users of the cluster, you should try to delete any files that are no longer needed or being used.

All data on the HPS, are considered to be related to your research and not to be of a personal nature. As such, all data is considered to be owned by the principal investigator for the allocation through which you have access to the cluster.

Teton is for the support of active research using the clusters. You should remove data files, etc. from the cluster promptly when you no longer actively working on the computations requiring them. This is to ensure that all users can avail themselves of these resources.

Note: None of the Teton file systems are backed up. We do data replication within the file system in order to minimize the loss of data in case of a system fault or failure.

Storage Allocations

Each individual researcher is assigned a standard storage allocation or quota on /home/project, and /gscratch. Researchers who use more than their allocated space will be blocked from creating new files until they reduce their use, or in the case of /project and /gscratch, request a one-time expansion or purchase additional storage. The chart below shows the storage allocations for individual accounts and the cost of additional space.

Directory Descriptions

/home 

  • Private user space for storing small, long term files such as environment settings, scripts, and source code.

/project 

  • Project-specific space shared among all members of a project for storing short term data, input, and output files.

/gscratch 

  • User-specific space for storing data that is actively being processed. This storage can be purged of old files as needed and is not for long term storage.

/lscratch 

  • Node specific space for storing short-term computational data relevant to jobs running on that node. Files are deleted nightly.

Directory Summary Table

Directory

Backed Up?

Default Allocation

Total Size

Media Type

Additional Storage Cost

Supported Protocols

/home

No

5 GB

1.2PB

Tier 1

$50 one-time setup fee and $100 / TB / year

NFS, CIFS, GPFS

/project

No

1 TB

1.2PB

Tier 1 & 2

One time proposal increase renewed every six months or $50 one-time setup fee and $100 / TB / year thereafter

NFS, CIFS, GPFS

/gscratch

No

5 TB

1.2PB

Tier 1 & 2

One time proposal increase renewed every six months or $50 one-time setup fee and $100 / TB / year thereafter

NFS, CIFS, GPFS

/lscratch

No

N/A

200GB or 1TB

N/A

N/A

N/A

Augmenting Capacity of Disk Allocation

Researchers working with or generating massive data sets that exceed the default 5 TB allocation, or having significant I/O needs should consider the following options:

  • Rent space on shared hardware: There is a set price per TB per 3 years. Please contact ARCC for the exact price.

  • Purchase additional storage disks to be incorporated into Teton: This option is appropriate for groups that need more space than the free offering, but don’t have the extreme space or performance demands that would require investing in dedicated hardware.

  • Buy your own dedicated storage hardware for Research Computing to host: If you need more than about 15 TB of storage or very high performance, dedicated hardware is more economical and appropriate. The exact choices are always evolving. Please contact ARCC for details.

File Deletion Policy

This describes ARCC's file deletion policy:

  • /home: Home directories will only be deleted after the owner has been removed from the university system.

  • /project: Project directories will be preserved for up to 6 months after project termination.

  • /gscratch: Files may be deleted as needed without warning if required for system productivity.

  • /lscratch: Files will be removed after thirty (30) days of not being used or accessed.


  • No labels