Goal: Provide new users with an understanding of what HPC is, how it works, and why it’s useful.

HPC: High Performance Computing

“High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”

HPC ≠ Desktop

HPC >> Desktop

What is a Cluster

A cluster is a collection of computers (nodes) that are connected through a fast internal network.
Connected to shared storage that is available to the users from any of the nodes within the cluster.

How do users interact with the cluster? Step 1

Users begin by logging into the HPC from their clients (desktops, laptops, workstations).
- Their login request goes from their client through the internet to a login node on the HPC.
- The login node serves as an initial access point to the cluster.

Login Nodes

The initial start point when you get into the cluster
Sometimes login nodes are also called “head nodes”, “landing pads”, or “submit nodes”.
As a gateway to the rest of the cluster, it can be used for uploading or downloading files, or running quick tests.
You should never use the login node to perform your actual computational work.

HPC Etiquette: Never run your jobs on the login node!

This is standard policy on most shared clusters. Why?
- Login nodes are used by everyone (all the HPC’s users), as the gateway to get onto the HPC
- If you’re utilizing significant resources on the login nodes by running your computations from there, you can affect other user’s ability to log into the HPC, making it unavailable to other users.
Utilizing computational resources on the login node results in consequences.
- At ARCC, we will send you a warning (usually through e-mail).
- If you are affecting the ability for other users to log in or perform their computations, we will kill your job.
- If you run jobs on the login nodes repeatedly, we may need to take away your access to the HPC.

How to know what types of work will affect the login node?

Viewing or editing smaller files, monitoring jobs, or submitting work should not use significant resources, but: You won’t always know in every situation, so err on the side of caution.

SSH

Web (OnDemand)

If you're not sure, you can always do an salloc/interactive job to be on the safe side:

$ salloc --account=myproject --time=40:00 --nodes=1 --ntasks-per-node=1 --cpus-per-task=8

OnDemand usually takes care of this for you by allocating necessary hardware according to your request form when the application is launched.

One exception: Launching shell access from OnDemand

Shell access, even through the web interface, should be thought of as a typical SSH session and will initially place you on a login node.

How do users interact with the cluster? Step 2

To perform your actual work (computations, modeling, simulations, etc) we can:

Submit a job to the cluster from the login node via the job scheduler (Slurm) or
Run an salloc from the Login node to request an interactive session.
- This will allocate our requested resources and place us on a compute node with the resources we need to perform our computational work.

In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.
Users may request specific allocations of resources available on the cluster - beyond that of a single node.
Allocated resources that are requested may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc and most allocation requests should include this information.

Compute Nodes

We typically have multiple users independently running jobs concurrently across compute nodes.
Resources are shared, but do not interfere with any one else’s resources.
- i.e. you have your own cores, your own block of memory.
If someone else’s job fails it does NOT affect yours.
Example: The GPU compute nodes part of this reservation each have 8 GPU devices. We can have different, individual jobs run on each of these compute nodes, without effecting each other.

Core Service 1: HPC: What does this look like?

We maintain a number of clusters for the purpose of allowing researchers to perform a variety of use cases such as running:

Computation-intensive analysis on large datasets.
- Megabytes / Gigabytes / Terabytes.
- On the filesystem in one / many files.
- In memory.
- CPU only vs GPU enabled.
Long large-scale simulations.
- Hours, days, weeks…
- Single job across multiple nodes each using multiple cores.
10s/100s/1000s of small short tasks - nothing is too small.
- Seconds, minutes, hours…
- Single node - one to many cores.
and lots of other use case…

UW IT Data Center

Types of HPC systems

There are generally two type of HPC systems:

Homogenous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system.
1. Derecho: (Mostly Homogeneous)
2. Cheyenne: (Decommissioned, Mostly Homogeneous)
Heterogenous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not.
1. Typically, similar compute nodes are grouped via partitions.
2. Can view information about partitions on our hardware summary tables:
  1. MedicineBow Hardware Summary Table
  2. Beartooth Hardware Summary Table

Cluster and Partitions

Reservations

A reservation can be considered a temporary partition.

It is a set of compute nodes reserved for a period of time for a set of users/projects, who get priority use.

For example, a reservation would look like the following:

ReservationName = biocompworkshop
StartTime = 06.09-09:00:00
EndTime   = 06.17-17:00:00 
Duration  = 8-08:00:00
Nodes     = mdgx01,t[402-421],tdgx01 NodeCnt=22 CoreCnt=720
Users     = Groups=biocompworkshop

Condominium Model

The “condo model”.

Allow researchers to invest into the cluster - purchasing additional compute nodes that they get priority to use.
‘preempt’ jobs outside of the investor’s project - allow the investor to start their jobs immediately.
- “immediately” if no other jobs from that investment project are already using the investment.
- A preempted job is stopped and automatically re-queued. When it starts will be determined by the current cluster utilization.
- Consider the idea of check-pointing which allows a job to continue analysis at the point where it was stopped.
This is managed by defining ‘investor partitions’.
ARCC Investment Program

Previous

Introducing UW ARCC

Workshop Home

Intro to HPC

Next

Getting Started

What is HPC?