High Performance Computing (in review)

In today's compute-intensive research environment, it is important to have resources that are able to perform particular tasks. ARCC provides users with The Teton Compute Environment, which is a high-performance computing (HPC) cluster with over 500 compute nodes that allow researches to perform computation-intensive analysis on large datasets. Built-in tools and the ability to request custom tools allow users to fine-tune their research procedures and have control over their data, projects, and collaborators.

Loren is a GPU based HPC cluster, it is a specialty HPC cluster used by Dr. Piri’s research group, the High Bay Research Group.

This page contains commonly used words and phrases that are used in research computing, if you are unsure of any of the terms, please visit the Glossary page to learn more.

Training


Overview

As research becomes more compute-intensive, ARCC has made high performance compute a core service. This core service is currently being performed by The Teton Compute Environment, allowing researchers to perform computation-intensive analysis on large datasets. Using Teton, researchers have control over their data, projects, and collaborators. Built-in tools help users get up and running in a short amount of time, and the ability to request custom tools allows users to fine-tune their research procedures.

Condo Model

The model for sustaining the Condo program is premised on faculty and principal investigators using equipment purchase funds from their grants or other available funds to purchase compute nodes (individual servers) which are then added to the Teton compute cluster. Condo computing resources are used simultaneously by multiple users. Teton is a condo model resource and as such, investors do have priority on invested resources. This is implemented through preemption and jobs not associated with the investment could be requested on the system when investor submits jobs. However, if the investor chooses not to implement preemption on their resources, ARCC can disable preemption while offering next-in-line access if that mode is preferred.

  • There are default concurrent limits in place to prevent individual project accounts and users from saturating the cluster away from others. The default limits are listed below. To incentivize investments into the condo system, investors will have their limits increased.

  • The system leverages a fairshare mechanism to offer a mechanism for projects that execute jobs on a more rare occasion priority over those who continuously run jobs on the system. To incentivize investments into the condo system, investors will have their fairshare value increased as well.

  • Finally, individual jobs occur runtime limits based on a study that was performed in ~2014 such that our maximum walltime for a compute job is 7 days. ARCC is currently evaluating this to determine whether the orthogonal limits of CPU count and walltime are optimal operational modes. ARCC is considering concurrent usage limits based on a relational combination of CPU count, Memory, and walltime that would allow more flexibility for different areas of science. There will likely still be an upper limit on individual compute job walltime as ARCC will not allow infinite job walltime and due to possible hardware faults.

HPC Clusters

Teton