Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Note

Introduction: The workshop session will provide a quick tour covering high-level concepts, commands and processes for using Linux and HPC on our MedicineBow cluster. It will cover enough to allow an attendee to access the cluster and to perform analysis associated with this workshop.

...

What is HPC

Info

HPC stands for High Performance Computing and is one of UW ARCC’s core services. HPC is the practice of aggregating computing power in a way that delivers a much higher performance than one could get out of a typical desktop or workstation. HPC is commonly used to solve large problems, and has some common use cases:

  1. Performing computation-intensive analyses on large datasets: MB/GB/TB in a single or many files, computations requiring RAM in excess of what is available on a single workstation, or analysis performed across multiple CPUs (cores) or GPUs.

  2. Performing long, large-scale simulations: Hours, days, weeks, spread across multiple nodes each using multiple cores.

  3. Running repetitive tasks in parallel: 10s/100s/1000s of small short tasks.

...

Info
  • We typically have multiple users independently running jobs concurrently across compute nodes - multi-tentancy.

  • Resources are shared, but do not interfere with any one else’s resources.

    • i.e. you have your own cores, your own block of memory.

  • If someone else’s job fails it does NOT affect yours.

...

Info

There are 2 types of HPC systems:

  1. Homogeneous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system. (Ex: NWSC’s Derecho)

  2. Heterogeneous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not. Usually, the nodes are grouped in partitions. MedicineBow is a heterogeneous cluster.

...

Cluster: Heterogeneous: Partitions

...

Info

Remember:

  • The MedicineBow Shell Access opens up a new browser tab that is running on a login node. Do not run any computation on these.
    [<username>@mblog1/2 ~]$

  • The OnDemand Interactive Desktop (terminal) is already running on a compute node.
    [<username>@mbcpu-001 ~]$

...

Info

As a courtesy to your colleagues, please do not run the following on any login nodes:  

  1. Anything compute-intensive (tasks using significant computational/hardware resources - Ex: using 100% cluster CPU)

  2. Any collection of a large # of tasks resulting in a similar hardware footprint to actions mentioned previously.  

  3. Either start an Interactive Desktop, an interactive session (salloc) or submit a job (sbatch) These will be covered later.

  4. See more on our ARCC HPC Policies.

...

Info

Across the class, you’ll be using a number of different environments.

  • Running specific software applications.

  • Programming with R and using various R libraries.

  • Programming with Python and using various Python packages.

  • Environments build with Miniconda - a package/environment manager.

Since the cluster has to cater for everyone we can not provide a simple desktop environment that provides everything.

Instead we provide modules that a user will load that configures their environment for their particular needs within a session.

Loading a module configures various environment variables within that Session.

...

Info

We have environments available based on compilers, Singularity containers, Conda, Linux Binaries

...

Info

We have created two modules specifically for this class:

Info

R/4.4.0 + Library of > 480 477 R Packages (this is the original gcc/13.2.0 built library)

Code Block
[]$ ls /project/genomicdatasci/software/r/libraries/
abind              DBI                 ggnewscale         libcoin         RcppAnnoy             sourcetools
alabaster.base     dbplyr              ggplot2            lifecycle       RcppArmadillo         sp
alabaster.matrix   DelayedArray        ggplotify          limma           RcppEigen             spam
...

The new gcc/14.2.0 version can be found under: /project/genomicdatasci/software/r/libraries_gcc14/

Info

R/4.3.3 and R Package Pigengene

Note

Due to dependency hell issues, we could not install Pigengene within the R library collection.

There are two separate environments.

With different versions of R.

...

Code Block
[salexan5@mblog2 testdirectory]$ module purge
[salexan5@mblog2 testdirectory]$ module use /project/genomicdatasci/software/modules/
[salexan5@mblog2 testdirectory]$ module load pigengene/3.18
[salexan5@mblog2 testdirectory]$ R --version
R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
...
# Start R
[salexan5@mblog2 testdirectory]$ R
R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
...
> library(Pigengene)
Loading required package: graph
Loading required package: BiocGenerics
...
Note

Due to dependency hell issues, we could not install Pigengene within the R library collection.

There are two separate environments.

With different versions of R.

...

Using RStudio with R/Library of Packages for this Class

Note

Since we are using RStudio, which is an IDE for R, i.e. a GUI, you need to perform this from an Interactive Desktop, via OnDemand.

From the Interactive Desktop, open a terminal:

...

Info

Remember: Since we are using RStudio, which is an IDE for R, i.e. a GUI, you need to perform this from an Interactive Desktop, via OnDemand.

From the Interactive Desktop, open a terminal:

...

Note

Typically because the resources you are requesting are not currently available.

Slurm will add your job to the queue, but it will be PENDING (P) while it waits for the necessary resources to become available.

As soon as they are, your job will start, and it’s status will update to RUNNING (R).

Slurm manages this for you.

...

Monitor your Job: Continued…

...