Introduction: The workshop session will provide a quick tour covering high-level concepts, commands and processes for using Linux and HPC on our Beartooth cluster. It will cover enough to allow an attendee to access the cluster and to perform analysis associated with this workshop.

Goals:

Introduce ARCC and what types of services we provide including “what is HPC?”
Define “what is a cluster”, and how is it made of partitions and compute nodes.
How to access and start using ARCC’s Beartooth cluster - using our SouthPass service.
How to start an interactive desktop and open a terminal to use Linux commands within.
Introduce the basics of Linux, the command-line, and how its File System looks on Beartooth.
Introduce Linux commands to allow navigation and file/folder manipulation.
Introduce Linux commands to allow text files to be searched and manipulated.
Introduce using a command-line text-editor and an alternative GUI based application.
How to setup a Linux environment to use R(/Python) and start RStudio, by loading modules.
How to start interactive sessions to run on a compute node, to allow computation, requesting appropriate resources.
How to put elements together to construct a workflow that can be submitted as a job to the cluster, which can then be monitored.

1 0 Getting Started
2 00 Introduction and Setting the Scope:
3 01 About UW ARCC and HPC
4 02 Using Southpass to access the Beartooth HPC Cluster
5 03 Using Linux and the Command Line
6 *** Break ***
7 04 Using Linux to Search/Parse Text Files
8 05 Lets start using R(/Python) and RStudio
9 06 Create a basic workflow and submitting jobs.
10 07 Summary and Next Steps

0 Getting Started

Users may log in with their BYODs (do you have a computer with you to follow along with the workshop?)
- Log into UWYO wifi if you can. (Non-UW users will be unable to).
Logging in:
- If you have a UWYO username and password: UW Users may test their HPC access by opening a browser and then going to the following URL: https://southpass.arcc.uwyo.edu.
- Standard wyologin page will be presented. Log in with your
- UWYO username and password.
- If you do not have a UWYO username and password: Come see me for a Yubikey and directions allow you to access the Beartooth HPC cluster if you do not have a UW account.

00 Introduction and Setting the Scope:

The roadmap to becoming a proficient HPC user can be long, complicated, and varies depending on the user. There are a large number of concepts to cover. Some of these concepts are included in today’s training but given time constraints, it’s impossible to get to all of them. This workshop session introduces key high-level concepts, and follows a very hands-on demonstration approach, for you to follow.

Our training will help provide the foundation necessary for you to use Beartooth cluster, specifically to perform some of the exercises later in this workshop over the week.

Because of our limited time this morning, please submit any questions to the slack channel for this workshop and workshop instructors can address them as they are available.

More extensive and in-depth information and walkthroughs are available on our wiki and under workshops/tutorials. You are welcome to dive into those in your own time. Content within them should provide you with a lot of the foundational concepts you would need to be familiar with to become a proficient HPC user.

01 About UW ARCC and HPC

Goals:

Describe ARCC’s role at UW.
Provide resources for ARCC Researchers to seek help.
Introduce staff members, including those available throughout the workshop.
Introduce the concept of an HPC cluster, it’s architecture and when to use one.
Introduce the Beartooth HPC architecture, hardware, and partitions.

About ARCC and how to reach us

Based on: Wiki Front Page: About ARCC

ARCC Wiki

In short, we maintain internally housed scientific resources including more than one HPC Cluster, data storage, and several research computing servers and resources.
We are here to assist UW researchers like yourself with your research computing needs.

3 ARCC Staff Members will be available through the course of the workshop if you need help using Beartooth:

ARCC End User Support

ARCC End User Support
Simon Alexander	HPC & Research Software Manager
Dylan Perkins	Research Computing Facilitator
Lisa Stafford	Research Computing Facilitator

What is HPC

HPC stands for High Performance Computing and is one of UW ARCC’s core services. HPC is the practice of aggregating computing power in a way that delivers a much higher performance than one could get out of a typical desktop or workstation. HPC is commonly used to solve large problems, and has some common use cases:

Performing computation-intensive analyses on large datasets: MB/GB/TB in a single or many files, computations requiring RAM in excess of what is available on a single workstation, or analysis performed across multiple CPUs (cores) or GPUs.
Performing long, large-scale simulations: Hours, days, weeks, spread across multiple nodes each using multiple cores.
Running repetitive tasks in parallel: 10s/100s/1000s of small short tasks.

Users log in from their clients (desktops, laptops, workstations) into a login node.
In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.
Users may request specific allocations of resources available on the cluster - beyond that of a single node.
Allocated resources may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc.

Users log in from their clients (desktops, laptops, workstations) into a login node.
In an HPC Cluster, each compute node can be thought of as it’s own desktop, but the hardware resources of the cluster are available collectively as a single system.
Users may request specific allocations of resources available on the cluster - beyond that of a single node.
Allocated resources may include CPUs (Cores), Nodes, RAM/Memory, GPUs, etc.

What is a Compute Node?

We typically have multiple users independently running jobs concurrently across compute nodes.
Resources are shared, but do not interfere with any one else’s resources.
- i.e. you have your own cores, your own block of memory.
If someone else’s job fails it does NOT affect yours.
Example: The two GPU compute nodes part of this reservation each have 8 GPU devices. We can have different, individual jobs run on each of these compute nodes, without effecting each other.

Homogeneous vs Heterogeneous HPCs

There are 2 types of HPC systems:

Homogeneous: All compute nodes in the system share the same architecture. CPU, memory, and storage are the same across the system. (Ex: NWSC’s Derecho)
Heterogeneous: The compute nodes in the system can vary architecturally with respect to CPU, memory, even storage, and whether they have GPUs or not. Usually, the nodes are grouped in partitions. Beartooth is a heterogeneous cluster and our partitions are described on the Beartooth Hardware Summary Table on our ARCC Wiki.

Beartooth Cluster: Heterogeneous: Partitions

Beartooth Hardware and Partitions

See Beartooth Hardware Summary Table on the ARCC Wiki.

Reservation

A reservation can be considered a temporary partition.

It is a set of compute nodes reserved for a period of time for a set of users/projects, who get priority use.

For this workshop we will be using the following: biocompworkshop:

ReservationName = biocompworkshop
StartTime = 06.09-09:00:00
EndTime   = 06.17-17:00:00 
Duration  = 8-08:00:00
Nodes     = mdgx01,t[402-421],tdgx01 NodeCnt=22 CoreCnt=720
Users     = Groups=biocompworkshop

Important Dates:

After the 17th of June this reservation will stop and you will drop down to general usage if you have another Beartooth project.
The project itself will be removed after the 24th of June. You will not be able to use/access it. Anything you require please copy out of the project.

02 Using Southpass to access the Beartooth HPC Cluster

Southpass is our Open OnDemand resource allowing users to access Beartooth over a web-based portal. Learn more about Southpass here.

Goals:

Demonstrate how users log into Southpass
Demonstrate requesting and using a XFCE Desktop Session
Introduce the Linux File System and how it compares to common workstation environments
- Introduce HPC specific directories and how they’re used
- Introduce Beartooth specific directories and how they’re used
Demonstrate how to access files using the Beartooth File Browsing Application
Demonstrate the use of emacs, available as a GUI based text-editor

Based on: Web Access to Beartooth: SouthPass

Log in and Access the Cluster

Login to Southpass

If you haven’t yet:

Open a browser of your choice.
Go to https://southpass.arcc.uwyo.edu
Log in with your UWYO username and password, or the username and password to the training account you’ve been provided.
Once in you will be presented with the Southpass Dashboard:

Using Southpass

Interactive Applications in Southpass are requested by filling out a webform to specify hardware requirements while you use the application. Other applications can be accessed without filling out a webform: Job Composer (To create batch scripts) Active Jobs (To view your active jobs) Home Directory (File Explorer/Upload/Download) Beartooth System Status (View cluster status)

Interactive Applications in Southpass are requested by filling out a webform to specify hardware requirements while you use the application. Other applications can be accessed without filling out a webform: Job Composer (To create batch scripts) Active Jobs (To view your active jobs) Home Directory (File Explorer/Upload/Download) Beartooth System Status (View cluster status)

Exercise: Beartooth XFCE Desktop

Requests are made through a webform in which you specifically request certain hardware or software to use on Beartooth.

Click on Beartooth XFCE Desktop
You will be presented with a form asking for specific information.
1. Project/Account: specifies the project you have access to on the HPC Cluster
2. Reservation: not usually used for our general cluster use, but set up to access specific hardware that has been reserved for this workshop.
3. Number of Hours: How long you plan to use the Remote Desktop Connection to the Beartooth HPC.
4. Desktop Configuration: How many CPUs and Memory you require to perform your computations within this remote desktop session.
5. GPU Type: GPU Hardware you want to access, specific to your use case. This may be set to “None - No GPU" if your computations do not require a GPU. Note: you can select DGX GPUs (Listed as V100s from the GPU Type drop down)
You should see an interactive session starting. When it’s ready, it will turn green.
1. Note the Host: field. Your Interactive session has been allocated to a specific host on the cluster. This is the node you are working on when you’re using your remote desktop session.
2. Click Launch Beartooth XFCE Desktop to open your Remote Desktop session
You should now see a Linux Desktop in your browser window
1. Beartooth runs Red Hat Enterprise Linux. If you’ve worked on a Red Hat System, it will probably look familiar.
2. If not, hopefully it looks similar enough to a Windows or Mac Graphical OS Interface.
  1. Apps dock at the bottom (Similar to Mac OS, or Pinned apps in taskbar on Windows OS)
  2. Desktop icons provide links to specific folder locations and files, like Mac and PC).

Note: While we use a webform to request Beartooth resources on Southpass, later training will show how resource configurations can be requested through command line via salloc or sbatch commands.

Structure of the Linux File System and HPC Directories

Linux File Structure

We are now remote logged into a Linux Desktop.

To take a look at the top level of the file structure, click on “Filesystem”.

This is specific to the Beartooth HPC but most Linux environments will look very similar

A screenshot of a computer

Description automatically generated

Linux Operating Systems (Generally)

Compare and Contrast: Linux, HPC Specific, Beartooth Specific

Based on: Beartooth Filesystem

HPC Specific Folders:

/home (Common across most shared HPC Resources)
1. What is it for? Similar to on a PC, and Macintosh HD → Users on a Mac
2. Permissions: It should have files specific to you, personally, as the HPC user. By default no one else has access to your files in your home.
3. Director Path: Every HPC user on Beartooth has a folder in on Beartooth under /home/<your_username> or $HOME
4. Default Quota: 25GB
/project (Common across most shared HPC Resources)
1. What is it for? Think of it as a shared folder for you and all your project members. Similar to /glade/campaign on NCAR HPC.
2. Permissions: All project members have access to the folder. By default, all project members can read any files or folders within, and can write in the main project directory.
3. Directory path: get to it at /project/biocompworkshop/
4. Subfolders in /project/biocompworkshop/ for each user are added to project when a user gets added to the project, but only that user can write to their folder.
5. Default Quota: 1TB which is for the project folder itself and includes all it’s contents and subfolders.
/gscratch (Scratch folder, common across most HPC resources but sometimes just called "scratch")
1. What is it for? It’s “scratch space”, so it’s storage dedicated for you to store temporary data you need access to.
2. Permissions: Like /home, contents is specific to you, personally, as the HPC user. By default no one else has access to your files in your /gscratch.
3. Director Path: Every HPC user on Beartooth has a gscratch directory in Beartooth under /gscratch/<your_username> or $SCRATCH
4. Default Quota: 5TB
  1. Don’t store anything in /gscratch that you need or don't have backed up elsewhere. It's not meant to store anything long term.
  2. Everyone’s /gscratch directory is subject to ARCC's purge policy.

Beartooth Specific

/apps (Specific to ARCC HPC) is like on Windows or on a Mac.
1. Where applications are installed and where modules are loaded from. (More on that later).
/alcova (Specific to ARCC HPC).
1. Additional research storage for research projects that may not require HPC but is accessible from beartooth.
2. You won’t have access to it unless you were added to an alcova project by the PI.

Exercise: File Browsing in Southpass GUI

Users can access their files using the south pass file browser app.

Demonstration opening emacs GUI based text editor

Once you’re in a XFCE Desktop Session:

Open the Applications Menu in the top right corner of your desktop
Choose Run Program
An Application Finder window will pop up. In the text box, type in: emacs
Click Launch
This will open a new window with the emacs text editor
Users can click on the File menu and select Visit New File to create a new file, or Open file to continue working on one they’ve already started.

03 Using Linux and the Command Line

Goals:

Introduce the shell terminal and command line interface
- Demonstrate starting a Beartooth SSH shell using Southpass
- Demonstrate information provided in a command prompt
Introduce Policy for HPC Login Nodes
Demonstrate how to navigate the file system to create and remove files and folders using command line interface (CLI)
- mkdir, cd, ls, mv, cp
Demonstrate the use of man, --help and identify when these should be used
Demonstrate using a command-line text editor, vi

Based on: The Command Line Interface

Exercise: Shell Terminal Introducing Command Line

Click the following Icon on the Beartooth Dashboard
This opens up a Beartooth SSH session in a web-based terminal:
Login will display:
1. Cluster you’ve logged into
2. How to get help
3. Important message(s) of the day
4. A printout of arccquota
Anatomy of Command Line Prompt:
1. Who (am I?):
2. What (system am I talking to/working on?):
3. Where (am I on the system?):

What am I Using?

Remember:

The Beartooth Shell Access opens up a new browser tab that is running on a login node. Do not run any computation on these.
[<username>@blog2 ~]$
The SouthPass Interactive Desktop (terminal) is already running on a compute node.
[<username>@t402 ~]$

Login Node Policy

As a courtesy to your colleagues, please do not run the following on any login nodes:

Anything compute-intensive (tasks using significant computational/hardware resources - Ex: using 100% cluster CPU)
Long running tasks (over 10 min)
Any collection of a large # of tasks resulting in a similar hardware footprint to actions mentioned previously.
Not sure? Use salloc to be on the safe side. This will be covered later.
Ex: salloc –-account=arccanetrain -–time 40:00
See more on ARCC’s Login Node Policy here

Demonstrating how to get help in CLI

man - Short for the manual page. This is an interface to view the reference manual for the application or command.
man pages are only available on the login nodes.

[arcc-t10@blog2 ~]$ man pwd
NAME
       pwd - print name of current/working directory
SYNOPSIS
       pwd [OPTION]...
DESCRIPTION
       Print the full filename of the current working directory.
       -L, --logical
              use PWD from environment, even if it contains symlinks
       -P, --physical
              avoid all symlinks
       --help display this help and exit
       --version
              output version information and exit
       If no option is specified, -P is assumed.
       NOTE:  your  shell  may have its own version of pwd, which usually supersedes the version described here.  Please refer to your shell's documentation
       for details about the options it supports.

--help - a built-in command in shell. It accepts a text string as the command line argument and searches the supplied string in the shell's documents.

[arcc-t10@blog1 ~]$ cp --help
Usage: cp [OPTION]... [-T] SOURCE DEST
  or:  cp [OPTION]... SOURCE... DIRECTORY
  or:  cp [OPTION]... -t DIRECTORY SOURCE...
Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.

Demonstrating file navigation in CLI

File Navigation demonstrating the use of:

pwd (Print Working Directory)
ls (“List” lists information about directories and any type of files in the working directory)
ls flags
- -l (tells the mode, # of links, owner, group, size (in bytes), and time of last modification for each file)
- -a (Lists all entries in the directory, including the entries that begin with a . which are hidden)
cd (Change Directory)
cd .. (Change Directory - up one level)

[arcc-t10@blog2 ~]$ pwd
/home/arcc-t10
[arcc-t10@blog2 ~]$ ls
Desktop  Documents  Downloads  ondemand  R
[arcc-t10@blog2 ~]$ cd /project/biocompworkshop
[arcc-t10@blog2 biocompworkshop]$ pwd
/project/biocompworkshop
[arcc-t10@blog2 biocompworkshop]$ cd arcc-t10
[arcc-t10@blog2 arcc-t10]$ ls -la
total 2.0K
drwxr-sr-x  2 arcc-t10 biocompworkshop 4.0K May 23 11:05 .
drwxrws--- 80 root     biocompworkshop 4.0K Jun  4 14:39 ..
[arcc-t10@blog2 arcc-t10]$ pwd
/project/biocompworkshop/arcc-t10
[arcc-t10@blog2 arcc-t10]$ cd ..
[arcc-t10@blog2 biocompworkshop]$ pwd
/project/biocompworkshop

Demonstrating how to create and remove files and folders using CLI

Creating, moving and copying files and folders:

touch (Used to create a file without content. The file created using the touch command is empty)
mkdir (Make Directory - to create an empty directory)
mv (Move - moves a file or directory from one location to another)
cd.. (Change Directory - up one level)
cp (Copy - copies a file or directory from one location to another)
- -r flag (Recursive)
~ (Alias for /home/user)
rm (Remove - removes a file or if used with -r, removes directory and recursively removes files in directory)

[arcc-t10@blog2 arcc-t10]$ touch testfile
[arcc-t10@blog2 arcc-t10]$ mkdir testdirectory
[arcc-t10@blog2 arcc-t10]$ ls
testdirectory  testfile
[arcc-t10@blog2 arcc-t10]$ mv testfile testdirectory
[arcc-t10@blog2 arcc-t10]$ cd testdirectory
[arcc-t10@blog2 testdirectory]$ ls
testfile
[arcc-t10@blog2 testdirectory]$ cd .. 
[arcc-t10@blog2 arcc-t10]$ cp -r testdirectory ~
[arcc-t10@blog2 arcc-t10]$ cd ~
[arcc-t10@blog2 ~]$ ls
Desktop  Documents  Downloads  ondemand  R  testdirectory 
[arcc-t10@blog2 ~]$ cd testdirectory
[arcc-t10@blog2 ~]$ ls
testfile
[arcc-t10@blog2 ~]$ rm testfile
[arcc-t10@blog2 ~]$ ls

Text Editor Cheatsheets

Vi/Vim Cheatsheet	Nano Cheatsheet

Vi/Vim Cheatsheet	Nano Cheatsheet
https://phoenixnap.com/kb/vim-commands-cheat-sheet	https://geek-university.com/nano-text-editor/

Note: On Beartooth, vi maps to vim i.e. if you open vi, you're actually starting vim.

Demonstrating vi/vim text editor

VI/Vim is one of several text editors available for Linux Command Line. (vi filename or vim filename)

i - to start insert mode (allows you to enter text)
<esc> key - to exit out of insert mode
dd - when not in insert mode, to delete a whole line
:q - outside of insert mode to quit
:wq - outside of insert mode to write the contents to the file, and then quit

[arcc-t10@blog2 arcc-t10]$ vi testfile

stuff and things
~                                                                                                                                
~                                                                                                                                
~                                                                                                                                
~                                                                                                                                
:wq            

[arcc-t10@blog2 arcc-t10]$ cat testfile
stuff and things

Try the vim tutor

Vim Tutor is a walkthrough for new users to get used to Vim.

Run vimtutor in the command line to begin learning interactively.

[arc-t10@blog2 ~]$ vimtutor
===============================================================================
=    W e l c o m e   t o   t h e   V I M   T u t o r    -    Version 1.7      =
=============================================================================== 
     Vim is a very powerful editor that has many commands, too many to 
     explain in a tutor such as this. This tutor is designed to describe 
     enough of the commands that you will be able to easily use Vim as 
     an all-purpose editor. 
     ...

* Break *

04 Using Linux to Search/Parse Text Files

Goals:

Using the command-line, demonstrate how to search and parse text files.
Show how export can be used to setup environment variables and echo to see what values they store.
Linux Commands:
- find
- cat / head / tail / grep
- sort / uniq
- Pipe | output from one command to the input of another, and redirect to a file using >, >>.

Based on: Intro to Linux Command-Line: View Find and Search Files

Your Environment: Echo and Export

# View the settings configured within your environment.
[]$ env
# View a particular environment variable
# PATH: Where you environment will look for execuatables/commands.
[]$ echo $PATH
# Create an environment variable that points to the workshop data folder.
[] export WS_DATA=/project/biocompworkshop/Data_Vault
# Check it has been correctly set.
[]$ echo $WS_DATA
/project/biocompworkshop/Data_Vault

Use Our Environment Variable

# Lets use it.
# Navigate to your home.
[]$ cd
# Navigate to the workshop data folder.
[~]$ cd $WS_DATA
[]$ pwd
/project/biocompworkshop/Data_Vault
# These are only available within this particular terminal/session.
# Once you close this terminal, they are gone.
# They are not available across other terminals.
# Advanced: To make 'permanent' you can update your ~/.bashrc

Search for a File

Based on: Search for a File

[]$ cd /project/biocompworkshop/salexan5/test_data
# Find a file using its full name.
[]$ find . -name "epithelial_overrep_gene_list.tsv"
./scRNASeq_Results/epithelial_overrep_gene_list.tsv
# Remember, Linux is case sensitive
# Returned to command prompt with no output.
[]$ find . -name "Epithelial_overrep_gene_list.tsv"
[]$
# Use case-insensitive option:
[]$ find . -iname "Epithelial_overrep_gene_list.tsv"
./test_data/scRNASeq_Results/epithelial_overrep_gene_list.tsv

Use Wildcards *

# Use Wildcards:
[]$ find . -name "epithelial*"
./scRNASeq_Results/epithelial_overrep_gene_list.tsv
./scRNASeq_Results/epithelial_de_gsea.tsv
[]$ find . -name "*.tsv"
./Grch38/Hisat2/exons.tsv
./Grch38/Hisat2/splicesites.tsv
./DE_Results/DE_sig_genes_DESeq2.tsv
./DE_Results/DE_all_genes_DESeq2.tsv
./scRNASeq_Results/epithelial_overrep_gene_list.tsv
./scRNASeq_Results/epithelial_de_gsea.tsv
./Pathway_Results/fc.go.cc.p.down.tsv
./Pathway_Results/fc.go.cc.p.up.tsv
./BatchCorrection_Results/DE_genes_uhr_vs_hbr_corrected.tsv

View the Contents of a File

Based on: View/Search a File

[]$ cd /project/biocompworkshop/salexan5/test_data/scRNASeq_Results
# View the contents of a TEXT based file:
# Prints everything.
[]$ cat epithelial_overrep_gene_list.tsv
# View 'page-by-page'
# Press 'q' to exit and return to the command-line prompt.
[]$ more epithelial_overrep_gene_list.tsv

View the Start and End of a File

# View the first 10 items.
[]$ head epithelial_overrep_gene_list.tsv
# View the first 15 items.
[]$ head -n 15 epithelial_overrep_gene_list.tsv
# View the last 10 items.
[]$ tail epithelial_overrep_gene_list.tsv
# View the last 15 items.
[]$ tail -n 15 epithelial_overrep_gene_list.tsv
# On a login node, remember you can use 'man head' 
# or tail --help to look up all the options for a command.

Search the Contents of a Text File

[]$ cd /project/biocompworkshop/salexan5/test_data/scRNASeq_Results
# Find rows containing "Zfp1"
# Remember: Linux is case-sensitive
# Searching for all lower case: zfp1
[]$ grep zfp1 epithelial_overrep_gene_list.tsv
[]$ 
# Searching with correct upper/lower case combination: Zfp1
# Returns all the lines that contain this piece of text.
[]$ grep Zfp1 epithelial_overrep_gene_list.tsv
Zfp106
Zfp146
Zfp185
Zfp1

Grep-ing with Case-Insensitive and Line Numbers

# Grep ignoring case.
[]$ grep -i zfp1 epithelial_overrep_gene_list.tsv
Zfp106
Zfp146
Zfp185
Zfp1
# What line numbers are the elements on?
[]$ grep -n -i zfp1 epithelial_overrep_gene_list.tsv
696:Zfp106
1998:Zfp146
2041:Zfp185
2113:Zfp1

Pipe: Count, Sort

Based on: Output Redirection and Pipes

[]$ cd /project/biocompworkshop/salexan5/test_data/scRNASeq_Results
# Pipe: direct the output of one command to the input of another.
# Count how many lines/rows are in a file.
[]$ cat epithelial_overrep_gene_list.tsv | wc -l
2254
# Alphabetically soft a file:
[] sort epithelial_overrep_gene_list.tsv
...
Zswim4
Zyx
Zzz3
Zzz3
# Count lines after sorting.
[]$ sort epithelial_overrep_gene_list.tsv | wc -l
2254

Uniq

# Find and list the unique elements within a file.
# You need to sort your elements first.
[] sort epithelial_overrep_gene_list.tsv | uniq
...
Zswim4
Zyx
Zzz3
# You can pipe multiple commands together.
# Find, list and count the unique elements within a file:
[] sort epithelial_overrep_gene_list.tsv | uniq  | wc -l
2253

Redirect Output into a File

# Redirect an output into a file.
# > : Over writes a file : >> : Appends to a file.
[] sort epithelial_overrep_gene_list.tsv > sorted.tsv
# This will fail for anyone else.
-bash: sorted.tsv: Permission denied
# You do not have write permission within this folder.
[]$ cd ..
[]$ ls -al
drwxr-sr-x  2 salexan5 biocompworkshop 4096 May 31 13:50 scRNASeq_Results
# Redirect to a location where you do have write permission - you home folder.
[]$ cd scRNASeq_Results/
[]$ sort epithelial_overrep_gene_list.tsv > ~/sorted.tsv
[]$ ls ~
... sorted.tsv ...
[]$ head ~/sorted.tsv

05 Lets start using R(/Python) and RStudio

Goals:

Using a terminal (via an Interactive Desktop), demonstrate how to load modules to setup an environment that uses R/RStudio and how to start the GUI.
Mention how the module system will be used, in later workshops, to load other software applications.
(Indicate how this relates to setting up environment variables behind the scenes.)
Further explain the differences between using a login node that requires an salloc to access a compute node, and that you're already running on a compute node (with limited resources) via an interactive desktop.
- Confirm arguments for partition, gres/gpu, reservation.
- Note that can confirm a GPU device is available by running nvidia-smi -L from the command-line.
Show how the resources from the Interactive Desktop configuration start mapping to those used by salloc (including defining reservations, and maybe partitions).

Based on Intro to Accessing the Cluster and the Module System

Open a Terminal

You can access a Linux terminal from SouthPass by:

Opening up an Interactive Desktop (reservation is biocompworkshop) and opening a terminal.
- Running on a compute node: Command prompt: [<username>@t402 ~]$
- The reservation is only available for this workshop: StartTime=06.09-09:00:00 EndTime=06.17-17:00:00 Duration=8-08:00:0
- Only select what you require:
  - How many hours? Your session will NOT run any longer that the amount of hours you requested.
  - Some Desktop Configurations will NOT work with some GPU Types.
  - Do you actually need a GPU?
    - Unless you software/library/package has been developed to utilize a GPU, simply selected one will NOT make any difference - this won’t make you code magically run faster.
Selecting a Beartooth Shell Access which opens up a new browser tab.
- Running on the login node: [<username>@blog1/2 ~]$

To run any GUI application, you must use SouthPass and an Interactive Desktop.

Setting Up a Session Environment

Across the week, you’ll be using a number of different environments.

Running specific software applications.
Programming with R and using various R libraries.
Programming with Python and using various Python packages.
Environments build with Miniconda - a package/environment manager.

Since the cluster has to cater for everyone we can not provide a simple desktop environment that provides everything.

Instead we provide modules that a user will load that configures their environment for their particular needs within a session.

Loading a module configures various environment variables within that Session.

What is Available?

We have environments available based on compilers, Singularity containers, Conda, Linux Binaries

[]$ module avail
[]$ gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
[]$ which gcc
/usr/bin/gcc
[]$ echo $PATH
/home/salexan5/bin:/apps/s/projects/core_hour_usage/bin:/apps/s/arcc/1.0/bin:/apps/s/slurm/latest/bin:
/apps/s/turbovnc/turbovnc-2.2.6/bin:/home/salexan5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:
/usr/sbin:/home/salexan5/.local/bin:/home/salexan5/bin:/home/salexan5/.local/bin:/home/salexan5/bin
[]$ module spider rstudio
----------------------------------------------------------------------------
  rstudio: rstudio/2023.9.0

Is Python and/or R available?

# An old version of Python is available on the System.
# Systems are updated! Do NOT rely on them for you environment regards versions/reproducability.
[]$ which python
/usr/bin/python
[]$ python --version
Python 3.8.17
# R is NOT available.
[]$ which R
/usr/bin/which: no R in (/home/salexan5/bin:/apps/s/projects/core_hour_usage/bin:
/apps/s/arcc/1.0/bin:/apps/s/slurm/latest/bin:/apps/s/turbovnc/turbovnc-2.2.6/bin:
/home/salexan5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/salexan5/.local/bin:
/home/salexan5/bin:/home/salexan5/.local/bin:/home/salexan5/bin)
# Nothing returned.
[]$ echo $R_HOME
[]$

Load a Compiler

# What's avail for a compiler?
[]$ module load gcc/12.2.0
[]$ module avail
# Notice there are a lot more applications available under this loaded compiler.
[]$ gcc --version
gcc (Spack GCC) 12.2.0
[]$ which gcc
/apps/u/spack/gcc/8.5.0/gcc/12.2.0-orvuxnl/bin/gcc
# Notice that the environment variables have been extended.
[]$ echo $PATH
/apps/u/spack/gcc/8.5.0/gcc/12.2.0-orvuxnl/bin:/apps/u/spack/gcc/12.2.0/zstd/1.5.2-5gdwnny/bin:
/home/salexan5/bin:/apps/s/projects/core_hour_usage/bin:/apps/s/arcc/1.0/bin:
/apps/s/slurm/latest/bin:/apps/s/turbovnc/turbovnc-2.2.6/bin:/home/salexan5/bin:
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/salexan5/.local/bin:
/home/salexan5/bin:/home/salexan5/.local/bin:/home/salexan5/bin
# Notice R is now available and newer versions of Python are available under gcc/12.2.0

Load a Newer Version of Python

[]$ module load python/3.10.6
[]$ which python
/apps/u/spack/gcc/12.2.0/python/3.10.6-7ginwsd/bin/python
[]$ python --version
Python 3.10.6

Typically Loading R

[]$ module load r/4.4.0
# Notice the environment variable has now been set.
[]$ echo $R_HOME
/apps/u/spack/gcc/12.2.0/r/4.4.0-7i7afpk/rlib/R
[]$ which R
/apps/u/spack/gcc/12.2.0/r/4.4.0-7i7afpk/bin/R
[]$ R --version
R version 4.4.0 (2024-04-24) -- "Puppy Cup"

You then perform: install.packages and manage these yourself.

Same with Python: You perform the pip install to install which ever Python packages you require.

Load R/RStudio for this Workshop

You can use module purge to reset your environment, or start a new terminal

[]$ module use /project/biocompworkshop/software/modules/
[]$ module avail
-------------------------------------------------- /project/biocompworkshop/software/modules --------------------------------------------------
   bam-readcount/0.8.0    fastp/0.23.4    r/4.4.0-biocomp    regtools/1.0.0    rseqc_hawsh/1.0.0    subread/2.0.6    tophat/2.1.1
[]$ module load r/4.4.0-biocomp
[]$ module load rstudio/2023.9.0
[]$ rstudio

Configure your R Environment for this Workshop

# Within the R Terminal:
> library(Seurat)
Error in library(Suerat) : there is no package called 'Suerat'
> .libPaths(c('/project/biocompworkshop/software/r/libraries/4.4.0', '/apps/u/spack/gcc/12.2.0/r/4.4.0-7i7afpk/rlib/R/library'))
# Notice how the list of System Library packages listed in RStudio has changed.
> library(Seurat)
Loading required package: SeuratObject
Loading required package: sp
Attaching package: 'SeuratObject'
The following objects are masked from 'package:base':
    intersect, t

To use the pre-installed libraries within an R script you will need to add the .libPaths(...) line to the start of your scripts.

Request Interactive Session (Compute Node) from a Login Node

# Short form:
# Notice we can request more memory.
[@blog1 ~]$ salloc -A biocompworkshop -t 4:00:00 --mem=4G -c 1 --reservation=biocompworkshop 
# Long form
# MUST define account/A and time/t
[@blog1 ~]$ salloc --account=biocompworkshop --time=4:00:00 --mem=4G --cpus-per-task=1 --reservation=biocompworkshop
salloc: Granted job allocation 16053847
salloc: Nodes t402 are ready for job
# Notice how the node name has changed in the command prompt.
[@t402 ~]$
[@t402 ~]$ exit
exit
salloc: Relinquishing job allocation 16053847
# Returns to the login node. Lots more options.
[@blog1 ~]$ salloc --help

# Short form:
# Notice we can request more memory.
[@blog1 ~]$ salloc -A biocompworkshop -t 4:00:00 --mem=4G -c 1 --reservation=biocompworkshop 
# Long form
# MUST define account/A and time/t
[@blog1 ~]$ salloc --account=biocompworkshop --time=4:00:00 --mem=4G --cpus-per-task=1 --reservation=biocompworkshop
salloc: Granted job allocation 16053847
salloc: Nodes t402 are ready for job
# Notice how the node name has changed in the command prompt.
[@t402 ~]$
[@t402 ~]$ exit
exit
salloc: Relinquishing job allocation 16053847
# Returns to the login node. Lots more options.
[@blog1 ~]$ salloc --help

Request Interactive Session (Compute Node) with a GPU

[@blog1 ~]$ salloc -A biocompworkshop -t 8:00:00 --mem=8G -c 2 -p dgx --gres=gpu:1 --reservation=biocompworkshop
salloc: Granted job allocation 16053855
salloc: Nodes mdgx01 are ready for job
[@mdgx01 ~]$ / [@tdgx01 ~]$
# Check you have a GPU allocated:
[@mdgx01 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-454b58ba-2ea3-c2d3-ca97-bfac4265c0b1)
# You get what you ask for!
# No GPU requested.
[@blog1 ~]$ salloc -A biocompworkshop -t 8:00:00 --mem=8G -c 2 --partition=dgx
salloc: Granted job allocation 16053857
salloc: Nodes mdgx01 are ready for job
[@mdgx01 ~]$ nvidia-smi -L
No devices found.

[@blog1 ~]$ salloc -A biocompworkshop -t 8:00:00 --mem=8G -c 2 -p dgx --gres=gpu:1 --reservation=biocompworkshop
salloc: Granted job allocation 16053855
salloc: Nodes mdgx01 are ready for job
[@mdgx01 ~]$ / [@tdgx01 ~]$
# Check you have a GPU allocated:
[@mdgx01 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-454b58ba-2ea3-c2d3-ca97-bfac4265c0b1)
# You get what you ask for!
# No GPU requested.
[@blog1 ~]$ salloc -A biocompworkshop -t 8:00:00 --mem=8G -c 2 --partition=dgx
salloc: Granted job allocation 16053857
salloc: Nodes mdgx01 are ready for job
[@mdgx01 ~]$ nvidia-smi -L
No devices found.

Request what you Need!

# You're telling this command to use 4 threads - 4 cores
[1]$ hisat2-build -p 4 ...
[@blog1]$ salloc --account=biocompworkshop --time=30:00 --reservation=biocompworkshop 
# Setup the Environment
[@t402]$ hisat2-build -p 4 --ss $INDEX/splicesites.tsv --exon $INDEX/exons.tsv $REFERENCE/chr22_with_ERCC92.fa $INDEX/chr22
...
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to read SNPs and splice sites: 00:00:00
Killed
[@blog1]$ salloc --account=biocompworkshop --time=30:00 --reservation=biocompworkshop -c 4
# Setup the Environment
[@t402]$ hisat2-build -p 4 --ss $INDEX/splicesites.tsv --exon $INDEX/exons.tsv $REFERENCE/chr22_with_ERCC92.fa $INDEX/chr22
...
Total time for call to driver() for forward index: 00:01:27

06 Create a basic workflow and submitting jobs.

Since RStudio is a GUI, demonstrate moving from running a script within RStudio to running using Rscript from the command-line.
Put the various elements of loading modules, moving into a folder, running an R file, that make up a basic workflow, into a script that can be submitted using sbatch to Slurm.
Map the salloc arguments to #SBATCH.
Show how to monitor a jobs using squeue as well as using the email related Slurm options.
Show how to request the DGX nodes and defining gres to specifically request a GPU.
Provide a basic template.

Based on:

Why Submit a Job

A single computation can take, minutes, hours, days, weeks, months. An interactive session quickly becomes impractical.

Submit a job to the Slurm queue - Slurm manages everything for you.

Everything you do on the command-line, working out your workflow, is put into a script.

Workflow:

What resources you require? (Interactive desktop configuration, salloc options)
What modules are loaded.
Which folder you’re running you computation within. Where the data is stored. Where you want the results.
Command-line calls being called.
Software applications being run.

Submit a Job to the Cluster

Convert salloc command-line options to an sbatch related script.

Options have defaults if not defined.

# salloc
[@blog1 ~]$ salloc -A biocompworkshop -t 8:00:00 --mem=8G -c 2 -p dgx --gres=gpu:1 --reservation=biocompworkshop
# sbatch
# Options within your bash script.
#SBATCH --account=biocompworkshop       # Account. MUST be defined.
#SBATCH --time=8:00:00                  # Time.    MUST be defined.    
#SBATCH --mem=8G                        # Memory.
##SBATCH --mem-per-cpu=1G               # Commented out. Default is 1G if no memory values defined.
#SBATCH --cpus-per-task=2               # CPUs per Task - default is 1 if not defined.
#SBATCH --partition=dgx                 # Partition - If not defined, Slurm will select.
#SBATCH --gres=gpu:1                    # Generic Resources
#SBATCH --reservation=biocompworkshop   # Reservation

Additional `sbatch` Options

#SBATCH --job-name=<job-name>
#SBATCH --nodes=<#nodes>                # Default is 1 if not defined.                
#SBATCH --ntasks-per-node=<#tasks/node> # Default is 1 if not defined.
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-addr>
#SBATCH --output=<filename>_%A.out      # Postfix the job id to <filename>
                                        # If not defined: slurm-<job-id>.out

Example Script: What Goes into It?

The bash script can contain:

Linux/bash commands and script.
Module loads.
Application command-line calls.

Lets consider our R workflow. I have:

R scripts copied into my /gscratch folder.
R related modules to load.
R scripts to run.
to track the time the job starts and ends.

Example Script: Running R Script

#!/bin/bash
# Comment: The first line 'shebang' is followed by the interpreter or the command that should be used to execute the script.
#SBATCH --job-name=r_job
#SBATCH --account=biocompworkshop
#SBATCH --time=10:00
#SBATCH --reservation=biocompworkshop
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-addr>
#SBATCH --output=r_%A.out
export R_FILES=/gscratch/$USER
echo "R Workflow Example"
START=$(date +'%D %T')
echo "Start:" $START
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_JOB_NAME:" $SLURM_JOB_NAME
echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST
module use /project/biocompworkshop/software/modules
module load r/4.4.0-biocomp
cd $R_FILES
Rscript test_r_libraries.R
END=$(date +'%D %T')
echo "End:" $END

Submit your Job

# From your Working Directory - the folder you are currently in.
[@blog2]$ ls
run_r.sh  test_data
# You can submit the job from the login node.
# Make a note of the job id.
[@blog2]$ sbatch run_r.sh
Submitted batch job 16054193
# ST Column: Status of P means Pending / R means Running.
[@blog2]$ squeue -u salexan5
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16054193     teton    r_job salexan5  R       0:06      1 t402
# Once the job is running, the defined output file will be generated.
[@blog2]$ ls
r_16054193.out  run_r.sh  test_data

Monitor your Job

# You can view the contents of your output file:
[@blog2]$ cat r_16054193.out
R Workflow Example
Start: 06/05/24 14:02:01
SLURM_JOB_ID: 16054193
SLURM_JOB_NAME: r_job
SLURM_JOB_NODELIST: m221
Sleeping...
[@blog1]$ squeue -u salexan5
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16054193     teton    r_job salexan5  R       0:18      1 t402
# If the job id is nolonger in the queue then it means the job is no longer running.
# It might have completed, or failed and exited.
[@blog1]$ squeue -u salexan5
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Why is my Job Not Running?

Previously we explained: The two GPU compute nodes part of this reservation each have 8 GPU devices. We can have different, individual jobs run on each of these compute nodes, without effecting each other.

So, we can have 16 concurrent jobs all running with a single GPU each.

But, what if a 17th person submitted a similar job?

Slurm will add this job to the queue, but it will be PENDING (P) while it waits for the necessary resources to become available.

As soon as there are, this 17th job will start, and it’s status will update to RUNNING (R).

Slurm manages this for you.

Monitor your Job: Continued…

# You can monitor the queue and/or log file to check if running.
[salexan5@blog2 salexan5]$ cat r_16054193.out
R Workflow Example
Start: 06/05/24 14:02:01
SLURM_JOB_ID: 16054193
SLURM_JOB_NAME: r_job
SLURM_JOB_NODELIST: t402
Sleeping...
Loading required package: SeuratObject
Loading required package: sp
Attaching package: ‘SeuratObject’
The following objects are masked from ‘package:base’:
    intersect, t
End: 06/05/24 14:02:29
# OR...

Alternative Monitoring of Job via Email: Job Efficiency

# Monitor your email:
Email 1:
Subject: beartooth Slurm Job_id=16054193 Name=r_job Began, Queued time 00:00:01
Email 2: Job Efficieny:
Subject: beartooth Slurm Job_id=16054193 Name=r_job Ended, Run time 00:00:28, COMPLETED, ExitCode 0
Job ID: 16054193
Cluster: beartooth
User/Group: salexan5/salexan5
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:07
CPU Efficiency: 25.00% of 00:00:28 core-walltime
Job Wall-clock time: 00:00:28
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1000.00 MB (1000.00 MB/core)

Example Script 2

This might look like something your cover in later sessions:

#!/bin/bash
#SBATCH --job-name=hisat2
#SBATCH --account=biocompworkshop       
#SBATCH --time=8:00:00                  
#SBATCH --cpus-per-task=4 
#SBATCH --mem=8G                        
#SBATCH --reservation=biocompworkshop   
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-addr>
#SBATCH --output=hisat2_%A.out
START=$(date +'%D %T')
echo "Start:" $START
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_JOB_NAME:" $SLURM_JOB_NAME
echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST
module load gcc/12.2.0 hisat2/2.2.1
export REFERENCE=/project/biocompworkshop/rshukla/Grch38/fasta
export INDEX=/project/biocompworkshop/rshukla/Grch38/Hisat2
# Comment: Location of the splicesites.tsv file.
cd /gscratch/$USER
hisat2-build -p 4 --ss splicesites.tsv --exon $INDEX/exons.tsv $REFERENCE/chr22_with_ERCC92.fa $INDEX/chr22
END=$(date +'%D %T')
echo "End:" $END

Examples and Cheat Sheets

Can be copied from: /project/biocompworkshop/arcc_notes

07 Summary and Next Steps

We’ve covered the following high-level concepts, commands and processes:
What is HPC and what is a cluster - focusing on ARC’s Beartooth cluster.
An introduction to Linux and its File System, and how to navigate around using an Interactive Desktop and/or using the command-line.
Linux command-line commands to view, search, parse, sort text files.
How to pipe the output of one command to the input of another, and how to redirect output to a file.
Using vim as a command-line text editor and/or emacs as a GUI within an Interactive Desktop.
Setting up your environment (using modules) to provide R/Python environments, and other software applications.
Accessing compute nodes via a SouthPass Interactive Desktop, and requesting different resources (cores, memory, GPUs).
Requesting interactive sessions (from a login node) using salloc.
Setting up a workflow, within a script, that can then be submitted to the Slurm queue using sbatch, and how to monitor jobs.

Further Assistance:

Everything covered can be found in previous workshops and additional information can be found on our Wiki.
ARCC personnel will be around in-person for the first three days to assist with cluster/Linux related questions and issues.
We will provide virtual support over Thursday/Friday. Submit questions via the Slack channel and these will be passed onto us, and we will endeavor to set up a zoom via our Office Hours.

BioCompWorkshop: ARCC Presentation

0 Getting Started

00 Introduction and Setting the Scope:

01 About UW ARCC and HPC

About ARCC and how to reach us

What is HPC

What is a Compute Node?

Homogeneous vs Heterogeneous HPCs

Beartooth Cluster: Heterogeneous: Partitions

Reservation

02 Using Southpass to access the Beartooth HPC Cluster

Log in and Access the Cluster

Login to Southpass

Using Southpass

Exercise: Beartooth XFCE Desktop

Structure of the Linux File System and HPC Directories

Linux File Structure

Linux Operating Systems (Generally)

Compare and Contrast: Linux, HPC Specific, Beartooth Specific

HPC Specific Folders:

Beartooth Specific

Exercise: File Browsing in Southpass GUI

Demonstration opening emacs GUI based text editor

03 Using Linux and the Command Line

Exercise: Shell Terminal Introducing Command Line

What am I Using?

Login Node Policy

Demonstrating how to get help in CLI

Demonstrating file navigation in CLI

Demonstrating how to create and remove files and folders using CLI

Text Editor Cheatsheets

Vi/Vim Cheatsheet

Nano Cheatsheet

Vi/Vim Cheatsheet

Nano Cheatsheet

Demonstrating vi/vim text editor

Try the vim tutor

*** Break ***

04 Using Linux to Search/Parse Text Files

Your Environment: Echo and Export

Use Our Environment Variable

Search for a File

Use Wildcards *

View the Contents of a File

View the Start and End of a File

Search the Contents of a Text File

Grep-ing with Case-Insensitive and Line Numbers

Pipe: Count, Sort

Uniq

Redirect Output into a File

05 Lets start using R(/Python) and RStudio

Open a Terminal

Setting Up a Session Environment

What is Available?

Is Python and/or R available?

Load a Compiler

Load a Newer Version of Python

Typically Loading R

Load R/RStudio for this Workshop

Configure your R Environment for this Workshop

Request Interactive Session (Compute Node) from a Login Node

Request Interactive Session (Compute Node) with a GPU

Request what you Need!

06 Create a basic workflow and submitting jobs.

Why Submit a Job

Submit a Job to the Cluster

Additional sbatch Options

Example Script: What Goes into It?

Example Script: Running R Script

Submit your Job

Monitor your Job

Why is my Job Not Running?

Monitor your Job: Continued…

Alternative Monitoring of Job via Email: Job Efficiency

Example Script 2

Examples and Cheat Sheets

07 Summary and Next Steps

Related content

* Break *

Additional `sbatch` Options