Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Provide new users with an understanding of what HPC is, how it works, and why it’s useful.

...

High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”

...

  • Users log in from their clients (desktops, laptops, workstations) into a login node.

  • In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.

  • Users may request specific allocations of resources available on the cluster - beyond that of a single node.

  • Allocated resources may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc.

...

  • We typically have multiple users independently running jobs concurrently across compute nodes.

  • Resources are shared, but do not interfere with any one else’s resources.

    • i.e. you have your own cores, your own block of memory.

  • If someone else’s job fails it does NOT affect yours.

  • Example: The GPU compute nodes part of this reservation each have 8 GPU devices. We can have different, individual jobs run on each of these compute nodes, without effecting each other.

...

Core Service 1: HPC: What does this look like?

We maintain a number of clusters for the purpose of allowing researchers to perform a variety of use cases such as running:

  • Computation-intensive analysis on large datasets.

    • Megabytes / Gigabytes / Terabytes.

    • On the filesystem in one / many files.

    • In memory. 

    • CPU only vs GPU enabled.

  • Long large-scale simulations. 

    • Hours, days, weeks…

    • Single job across multiple nodes each using multiple cores.

  • 10s/100s/1000s of small short tasks - nothing is too small.

    • Seconds, minutes, hours…

    • Single node - one to many cores.

  • and lots of other use case…

...

There are generally two type of HPC systems: 

  1. Homogenous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system.

    1. Derecho(Mostly Homogeneous)

    2. Cheyenne: (Decommissioned, Mostly Homogeneous)

  2. Heterogenous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not.

    1. Typically, similar compute nodes are grouped via partitions.

    2. Can view information about partitions on our hardware summary tables:

      1. MedicineBow Hardware Summary Table

      2. Beartooth Hardware Summary Table

...

A reservation can be considered a temporary partition.

It is a set of compute nodes reserved for a period of time for a set of users/projects, who get priority use.

...

Code Block
ReservationName = biocompworkshop
StartTime = 06.09-09:00:00
EndTime   = 06.17-17:00:00 
Duration  = 8-08:00:00
Nodes     = mdgx01,t[402-421],tdgx01 NodeCnt=22 CoreCnt=720
Users     = Groups=biocompworkshop

...

Condominium Model 

The “condo model”. 

  • Allow researchers to invest into the cluster - purchasing additional compute nodes that they get priority to use.

  • preempt’ jobs outside of the investor’s project - allow the investor to start their jobs immediately.

    • immediately” if no other jobs from that investment project are already using the investment.

    • A preempted job is stopped and automatically re-queued. When it starts will be determined by the current cluster utilization.

    • Consider the idea of check-pointing which allows a job to continue analysis at the point where it was stopped.

  • This is managed by defining ‘investor partitions’.

  • ARCC Investment Program

...