Goal: Provide new users with an understanding of what HPC is, how it works, and why it’s useful.
...
“High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.”
...
A cluster is a collection of computers (nodes) that are connected through a fast internal network.
Connected to shared storage that is available to the users from any of the nodes within the cluster.
...
How do users interact with the cluster? Step 1
...
Users begin by logging into the HPC from their clients (desktops, laptops, workstations).
Their login request goes from their client through the internet to a login node on the HPC.
The login node serves as an initial access point to the cluster.
...
Login
...
Login Nodes
...
The initial start point when you get into the cluster
Sometimes login nodes are also called “head nodes”, “landing pads”, or “submit nodes”.
As a gateway to the rest of the cluster, it can be used for uploading or downloading files, or running quick tests.
You should never use the login node to perform your actual computational work.
...
HPC Etiquette: Never run your jobs on the login node!
|
---|
...
How to know what types of work will affect the login node?
Viewing or editing smaller files, monitoring jobs, or submitting work should not use significant resources, but: You won’t always know in every situation, so err on the side of caution.
SSH | Web (OnDemand) |
---|---|
If you're not sure, you can always do an salloc/interactive job to be on the safe side:
| OnDemand usually takes care of this for you by allocating necessary hardware according to your request form when the application is launched. One exception: Launching shell access from OnDemand Shell access, even through the web interface, should be thought of as a typical SSH session and will initially place you on a login node. |
...
How do users interact with the cluster? Step 2
...
To perform your actual work (computations, modeling, simulations, etc) we can:
Submit a job to the cluster from the login node via the job scheduler (Slurm) or
Run an
salloc
from the Login node to request an interactive job/session.This will allocate our requested resources and place us on a compute node with the resources we need to perform our computational work.
...
How do users interact with the cluster? Job with a Submission Script
...
Ex: Job Submission Script
This is a script that runs a job, which consists of tasks
Code Block |
---|
#!/bin/bash
#SBATCH --account=<project-name> # Use #SBATCH to define Slurm related values.
#SBATCH --time=10:00 # You MUST define an account and wall-time.
#SBATCH --reservation=<reservation-name>
#SBATCH --mail-type=ALL # Slurm will e-mail on all job related events
#SBATCH --mail-user=<email address> # Send e-mails to <email address>
echo "SLURM_JOB_ID:" $SLURM_JOB_ID # Can access Slurm related Environment variables.
start=$(date +'%D %T') # Can call bash commands.
echo "Start:" $start # Bash command to print start time
module purge
module load gcc/13.2.0 python/3.10.6 # Load the modules you require for your environment.
cd ~/ # Change directory to your home.
python python01.py # Run your scripts/commands.
sleep 1m
end=$(date +'%D %T')
echo "End:" $end # Print end time |
...
Jobs are allocated to hardware within the cluster by a scheduler
...
In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.
Users may request specific allocations of resources available on the cluster - beyond that of a single node.
Allocated resources that are requested may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc and most allocation requests should include this information.
...
Compute Nodes
...
|
---|
...
Core Service 1: HPC: What does this look like?
We maintain a number of clusters for the purpose of allowing researchers to perform a variety of use cases such as running:
Computation-intensive analysis on large datasets.
Megabytes / Gigabytes / Terabytes.
On the filesystem in one / many files.
In memory.
CPU only vs GPU enabled.
Long large-scale simulations.
Hours, days, weeks…
Single job across multiple nodes each using multiple cores.
10s/100s/1000s of small short tasks - nothing is too small.
Seconds, minutes, hours…
Single node - one to many cores.
and lots of other use case…
You can just use our HPC as an alternative to bogging down your local machine.
...
UW IT Data Center
...
Types of HPC systems
There are generally two type of HPC systems:
Homogenous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system.
Heterogenous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not.
Typically, similar compute nodes are grouped via partitions.
Can view information about partitions on our hardware summary tables:
...
Cluster and Partitions
...
...
Reservations
A reservation can be considered a temporary partition.
It is a set of compute nodes reserved for a period of time for a set of users/projects, who get priority use.
...
Code Block |
---|
ReservationName = biocompworkshop StartTime = 06.09-09:00:00 EndTime = 06.17-17:00:00 Duration = 8-08:00:00 Nodes = mdgx01,t[402-421],tdgx01 NodeCnt=22 CoreCnt=720 Users = Groups=biocompworkshop |
...
Condominium Model
The “condo model”.
Allow researchers to invest into the cluster - purchasing additional compute nodes that they get priority to use.
‘preempt’ jobs outside of the investor’s project - allow the investor to start their jobs immediately.
“immediately” if no other jobs from that investment project are already using the investment.
A preempted job is stopped and automatically re-queued. When it starts will be determined by the current cluster utilization.
Consider the idea of check-pointing which allows a job to continue analysis at the point where it was stopped.
This is managed by defining ‘investor partitions’.
...