Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Goal: Provide new users with an understanding of what HPC is, how it works, and why it’s useful.



HPC: High Performance Computing

High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”

HPC ≠ Desktop

HPC >> Desktop


What is a Cluster 

  • Users log in from their clients (desktops, laptops, workstations) into a login node.

  • In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.

  • Users may request specific allocations of resources available on the cluster - beyond that of a single node.

  • Allocated resources may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc.


Compute Nodes 

  • We typically have multiple users independently running jobs concurrently across compute nodes.

  • Resources are shared, but do not interfere with any one else’s resources.

    • i.e. you have your own cores, your own block of memory.

  • If someone else’s job fails it does NOT affect yours.

  • Example: The GPU compute nodes part of this reservation each have 8 GPU devices. We can have different, individual jobs run on each of these compute nodes, without effecting each other.


Core Service 1: HPC: What does this look like?

We maintain a number of clusters for the purpose of allowing researchers to perform a variety of use cases such as running:

  • Computation-intensive analysis on large datasets.

    • Megabytes / Gigabytes / Terabytes.

    • On the filesystem in one / many files.

    • In memory. 

    • CPU only vs GPU enabled.

  • Long large-scale simulations. 

    • Hours, days, weeks…

    • Single job across multiple nodes each using multiple cores.

  • 10s/100s/1000s of small short tasks - nothing is too small.

    • Seconds, minutes, hours…

    • Single node - one to many cores.

  • and lots of other use case…


UW IT Data Center


Types of HPC systems 

There are generally two type of HPC systems: 

  1. Homogenous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system.

    1. Derecho: (Mostly Homogeneous)

    2. Cheyenne: (Decommissioned, Mostly Homogeneous)

  2. Heterogenous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not.

    1. Typically, similar compute nodes are grouped via partitions.

    2. Can view information about partitions on our hardware summary tables:

      1. MedicineBow Hardware Summary Table

      2. Beartooth Hardware Summary Table


Cluster and Partitions


Reservations

A reservation can be considered a temporary partition.

It is a set of compute nodes reserved for a period of time for a set of users/projects, who get priority use.

For example, a reservation would look like the following:

ReservationName = biocompworkshop
StartTime = 06.09-09:00:00
EndTime   = 06.17-17:00:00 
Duration  = 8-08:00:00
Nodes     = mdgx01,t[402-421],tdgx01 NodeCnt=22 CoreCnt=720
Users     = Groups=biocompworkshop

Condominium Model 

The “condo model”. 

  • Allow researchers to invest into the cluster - purchasing additional compute nodes that they get priority to use.

  • preempt’ jobs outside of the investor’s project - allow the investor to start their jobs immediately.

    • immediately” if no other jobs from that investment project are already using the investment.

    • A preempted job is stopped and automatically re-queued. When it starts will be determined by the current cluster utilization.

    • Consider the idea of check-pointing which allows a job to continue analysis at the point where it was stopped.

  • This is managed by defining ‘investor partitions’.

  • ARCC Investment Program


Next Steps

 

 

  • No labels