Goal: Work through the steps of pulling some data from the Internet, creating a Conda environment that will be used to perform some analysis on this data, that is performed from a job submission.

This exercise is structured into three parts:

Description: This details what to do and results to check your work against. Try and perform this as is and see how far you can get to test your current knowledge and highlight areas to look back on review.
Pointers and Guides: Use these sections to assist you if you’re unsure and would like some hints and suggestions.
Answer: This will lay out one (of potentially many) approaches to perform this exercise.
1. Please do not just jump to this section and cut-n-paste - what have you actually learned from doing this?
2. To become a good HPC user you need to engage with this exercise and work through it and learn from applying/verifying what you know, and problem solving and resolving mistakes.

The Exercise Extensions section will provide questions for you to consider to maybe make your workflow more advanced and introduces circumstances that we have experienced with existing users.

1 Description
2 Pointers and Guides: Initial Consideration
3 Answer
4 Exercise Extensions

Description

High Level:

Create a self contained Conda environment that provides the HTseq application that can be used to submit a job that utilizes a single node using four cores to perform some guided analysis.
The Conda environment needs to be created under a project and share-able with others within the project.
You will be directed to where data for the analysis can be retrieved from the Internet. This data will need to be downloaded to the cluster.
Scripts, data and resulting analysis will need to be stored within the /project/<project-name> and share-able.

Data: Retrieve data from the HTSeq example data folder. Specifically you will be using the following two files:

bamfile_no_qualities.bam
bamfile_no_qualities.gtf

Once downloaded, the two files should have size:

966147 bamfile_no_qualities.bam
282781 bamfile_no_qualities.gtf

Analysis: Perform the following analysis using htseq-count command of the form:

htseq-count <bam-file>.bam <gtf-file>.gtf

The resulting analysis should be saved in a file called count.txt

The analysis should complete and write out:

2358 GFF lines processed.
9997 alignment records processed.

Output: After preforming the analysis, your count.txt file should start and finish with:

16S_rRNA        2
23S_rRNA        9
5S_rRNA-1       0
5S_rRNA-2       0
TK0001  8
TK0002  0
TK0003  0
TK0004  0
TK0005  0
TK0006  0

tRNA-Tyr        0
tRNA-Val-1      0
tRNA-Val-2      0
tRNA-Val-3      0
tRNA-Val-4      0
__no_feature    290
__ambiguous     270
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0

Pointers and Guides: Initial Consideration

Carefully read through the description and take note of key points. Somethings (but not everything) that you might want to consider are:

Data, scripts and output should be within a project and share-able. Where and how will this be structured and organized? How will you manage your data?
The Conda environment should be self-contained i.e. all installed libraries/packages should be within the Conda environment location. Nothing should be installed locally within your /home.
The analysis should be performed by submitting a job to the cluster. What resources do you need to request? How will you translate your workflow into a submission script?

Data Management: Structure and Organize the Work:

Where should you create a share-able Conda environment: Suggestion:

/project/<project-name>/software/

How will you organize scripts, data and results? Suggestion:

/project/<project-name>/
  exercise01/
    scripts/
    data/
    results/

Where would you locate a README.txt file describing what you’ve down?

These folders will not already exist, you will need to create them: Review The Linux File System.

Review Data Management and read Software Migration: MedicineBow.

Getting the Data

Considerations:

Navigating to the example data page, what happens if you click on either of the two required files?
If these are downloaded to desktop, how will you transfer them across to the cluster?
Can you use wget to download the raw files?
- How can you find help on wget? Review Getting Help.
- Take note that you require the raw data.
- Can you check that you’ve download the actual raw data and not just some html page that refers to it?
How do you check the size of a file? Review the Linux ls command.

Review Intro to Data Transfer.

Creating the Conda Environment

Considerations:

Read the Documentation! Read through the installation page.
This talks about using pip install.
Is it available as a Conda package? Under which channels?
How are you going to deal with required dependencies?
It has been suggested to locate the Conda environment under /project/<project-name>/software/. How will you create it under there?
The environment has to be self-contained - what do you need to configure so nothing is installed locally under your /home?

Review:

Plan Your Workflow

Considerations:

Start understanding how you’re going to perform the analysis.
Where are you calling scripts from?
How do you activate your Conda environment and use the htseq-count command?
You’re asked to use four cores - how does the htseq-count command use these? Is there an option to set? How would you find this out?
Since you shouldn’t run computation on the login nodes, how would you use an interactive session to start testing this?

Review Slurm: Workflows and Best Practices

Submit the Job

Once you have you workflow planned and tested, transfer it into a submission script so that is can be summited to the cluster.

Review: Submit Jobs.

Analyze the Results

Considerations:

How can you view the start and end of a text file?

Review View the content of files.

Answer

The follow answer is just one (of potentially many) possible approaches to implement this exercise.

If you’d like to review anything and/or discuss alternatives, please contact us and we’ll happily arrange to have a conversation.

Setup Structure Under a Project

This needs to be share-able across a project, so you can not create anything within your /home.

[]$ cd /project/<project-name>/
[]$ mkdir exercise01
[]$ cd exercise01/
[]$ mkdir scripts data results
[]$ pwd
/project/<project-name>/exercise01
[]$ ls
data  results  scripts

Get the Data

In this example we’re using the wget command to download the raw data.

[]$ cd data/
[]$ wget https://raw.githubusercontent.com/htseq/htseq/main/example_data/bamfile_no_qualities.gtf
[]$ wget https://github.com/htseq/htseq/raw/main/example_data/bamfile_no_qualities.bam

[]$ ls -al
...
-rw-r--r-- 1 <username> <project-name> 966147 Aug 29 09:46 bamfile_no_qualities.bam
-rw-r--r-- 1 <username> <project-name> 282781 Aug 29 09:45 bamfile_no_qualities.gtf

Notice how the URLs are different. Depending on the type of data and how it is displayed from the repository, you might need to click around to find the raw version.

Create the Conda Environment

This needs to be share-able, so the suggestion is to install under: /project/<project-name>/software and then under a child folder called conda-envs to organizing Conda environments under the same location.
Although the installation documentation talks about using pip install and lists variation dependencies, you can always check if a Conda package is available. Also, trying Googling: bioconda / packages / htseq.
Take a note of the channel where this is found. Do you have this configured? Review Conda Channels.

[]$ cd /project/<project-name>/software
[]$ mkdir conda-envs
[]$ cd conda-envs/

[]$ pwd
/project/<project-name>/software/conda-envs

[]$ module load miniconda3/24.3.0

[]$ conda search htseq
Loading channels: done
No match found for: htseq. Search: *htseq*

[]$ conda search bioconda::htseq
Loading channels: done
# Name                       Version           Build  Channel
...
htseq                          2.0.5  py39hd5189a5_1  bioconda

[]$ conda create -p htseq_2.0.5
...
# To activate this environment, use
#     $ conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5

[]$ conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5
(/cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5) []$ conda install bioconda::htseq

Since there is a Conda package available, this should make installation of dependencies easier.
But, is this the latest version? When creating this exercise Conda was 2.0.5, but the pip version was 2.0.8 - is this significant?
Once installed, run the htseq-count command’s help and check its output to look for parallel related options.

() []$ conda install bioconda::htseq
() []$ export PYTHONUSERBASE=$CONDA_PREFIX
() []$ htseq-count -h
...
  -n NPROCESSES, --nprocesses NPROCESSES
                        Number of parallel CPU processes to use (default: 1). This option is useful to process several input files at once.
                        Each file will use only 1 CPU. It is possible, of course, to split a very large input SAM/BAM files into smaller
                        chunks upstream to make use of this option.
...
() []$ conda deactivate
[]$

Note: The use of:

export PYTHONUSERBASE=$CONDA_PREFIX

This Conda environment needs to be self-contained and not use/install anything under a user’s local /home.

For example, if you have a later version of numpy (say numpy/2.0.0) under ~/.local/lib/python/X.Y/ this will clash with the version within the Conda environment and you’ll see an error of the form:

() []$ htseq-count --help
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
...

Submit the Job

Once testing has been performed the workflow finalized, the next step is to translate this into a submission script.

[]$ cd /project/arcc/exercise01/
[]$ cd scripts/
[]$ vim run.sh

#!/bin/bash
#SBATCH --job-name=htseq_analysis
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-address>
#SBATCH --output=output/htseq_output_%A.out

echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_JOB_NAME:" $SLURM_JOB_NAME
echo "SLURM_JOB_PARTITION" $SLURM_JOB_PARTITION
echo "SLURM_JOB_NUM_NODES:" $SLURM_JOB_NUM_NODES
echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST
echo "SLURM_CPUS_PER_TASK:" $SLURM_CPUS_PER_TASK

module purge
module load miniconda3/24.3.0

conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5
export PYTHONUSERBASE=$CONDA_PREFIX

echo "Starting Analysis:"
htseq-count ../data/bamfile_no_qualities.bam ../data/bamfile_no_qualities.gtf -n 4 > ../results/count.txt
echo "Done."

conda deactivate

Submit this script and then notice the various outputs that relate to the relative paths defined:

[]$ sbatch run.sh
Submitted batch job <job-id>

[]$ ls output/
htseq_output_<job-id>.out

[]$ ls ../results/
count.txt

Look at the Results

To confirm your results are correct, look at the Slurm job output, and the head/tail of the count.txt file:

[]$ cat output/htseq_output_<job-id>.out
SLURM_JOB_ID: <job-id>
SLURM_JOB_NAME: htseq_analysis
SLURM_JOB_PARTITION inv-arcc
SLURM_JOB_NUM_NODES: 1
SLURM_JOB_NODELIST: mbcpu-025
SLURM_CPUS_PER_TASK: 4
The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) slurm/latest   2) arcc/1.0
Starting Analysis:
2358 GFF lines processed.
9997 alignment records processed.
Done.

16S_rRNA        2
23S_rRNA        9
5S_rRNA-1       0
5S_rRNA-2       0
TK0001  8
TK0002  0
TK0003  0
TK0004  0
TK0005  0
TK0006  0

tRNA-Tyr        0
tRNA-Val-1      0
tRNA-Val-2      0
tRNA-Val-3      0
tRNA-Val-4      0
__no_feature    290
__ambiguous     270
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0

Exercise Extensions

To expand on this exercise:

How would you create a module file to replace having to activate your Conda environment?
How would you modify your workflow so that all analysis was performed under /gscratch - what does your workflow look like to move data from and to the project?
If you had a number of htseq-count commands to perform how would you update your overall workflow?
- What if you had 5 to perform?
- What if you had 50 to perform?
- What if you had 500 to perform?
- What if you had 5000 to perform?
How would you update the conda environment to update it so that is can be used as kernel within Jupyter?

Workshop Home

Putting it into Practice

Next

Ex02:

ARCC Wiki

Put it into Practice Ex01: Conda and Job Submission

Description

Pointers and Guides: Initial Consideration

Data Management: Structure and Organize the Work:

Getting the Data

Creating the Conda Environment

Plan Your Workflow

Submit the Job

Analyze the Results

Answer

Setup Structure Under a Project

Get the Data

Create the Conda Environment

Submit the Job

Look at the Results

Exercise Extensions