/
Put it into Practice Ex01: Conda and Job Submission

Put it into Practice Ex01: Conda and Job Submission

Goal: Work through the steps of pulling some data from the Internet, creating a Conda environment that will be used to perform some analysis on this data, that is performed from a job submission.


This exercise is structured into three parts:

  1. Description: This details what to do and results to check your work against. Try and perform this as is and see how far you can get to test your current knowledge and highlight areas to look back on review.

  2. Pointers and Guides: Use these sections to assist you if you’re unsure and would like some hints and suggestions.

  3. Answer: This will lay out one (of potentially many) approaches to perform this exercise.

    1. Please do not just jump to this section and cut-n-paste - what have you actually learned from doing this?

    2. To become a good HPC user you need to engage with this exercise and work through it and learn from applying/verifying what you know, and problem solving and resolving mistakes.

The Exercise Extensions section will provide questions for you to consider to maybe make your workflow more advanced and introduces circumstances that we have experienced with existing users.



Description

High Level:

  • Create a self contained Conda environment that provides the HTseq application that can be used to submit a job that utilizes a single node using four cores to perform some guided analysis.

  • The Conda environment needs to be created under a project and share-able with others within the project.

  • You will be directed to where data for the analysis can be retrieved from the Internet. This data will need to be downloaded to the cluster.

  • Scripts, data and resulting analysis will need to be stored within the /project/<project-name> and share-able.

Data: Retrieve data from the HTSeq example data folder. Specifically you will be using the following two files:

  1. bamfile_no_qualities.bam

  2. bamfile_no_qualities.gtf

Once downloaded, the two files should have size:

966147 bamfile_no_qualities.bam 282781 bamfile_no_qualities.gtf

Analysis: Perform the following analysis using htseq-count command of the form:

htseq-count <bam-file>.bam <gtf-file>.gtf

The resulting analysis should be saved in a file called count.txt

The analysis should complete and write out:

2358 GFF lines processed. 9997 alignment records processed.

Output: After preforming the analysis, your count.txt file should start and finish with:

16S_rRNA 2 23S_rRNA 9 5S_rRNA-1 0 5S_rRNA-2 0 TK0001 8 TK0002 0 TK0003 0 TK0004 0 TK0005 0 TK0006 0
tRNA-Tyr 0 tRNA-Val-1 0 tRNA-Val-2 0 tRNA-Val-3 0 tRNA-Val-4 0 __no_feature 290 __ambiguous 270 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 0

Pointers and Guides: Initial Consideration

Carefully read through the description and take note of key points. Somethings (but not everything) that you might want to consider are:

  • Data, scripts and output should be within a project and share-able. Where and how will this be structured and organized? How will you manage your data?

  • The Conda environment should be self-contained i.e. all installed libraries/packages should be within the Conda environment location. Nothing should be installed locally within your /home.

  • The analysis should be performed by submitting a job to the cluster. What resources do you need to request? How will you translate your workflow into a submission script?


Data Management: Structure and Organize the Work:

  • Where should you create a share-able Conda environment: Suggestion:

/project/<project-name>/software/
  • How will you organize scripts, data and results? Suggestion:

/project/<project-name>/ exercise01/ scripts/ data/ results/

Where would you locate a README.txt file describing what you’ve down?

These folders will not already exist, you will need to create them: Review The Linux File System.


Getting the Data

Considerations:

  • Navigating to the example data page, what happens if you click on either of the two required files?

  • If these are downloaded to desktop, how will you transfer them across to the cluster?

  • Can you use wget to download the raw files?

    • How can you find help on wget? Review Getting Help.

    • Take note that you require the raw data.

    • Can you check that you’ve download the actual raw data and not just some html page that refers to it?

  • How do you check the size of a file? Review the Linux ls command.


Creating the Conda Environment

Considerations:

  • Read the Documentation! Read through the installation page.

  • This talks about using pip install.

  • Is it available as a Conda package? Under which channels?

  • How are you going to deal with required dependencies?

  • It has been suggested to locate the Conda environment under /project/<project-name>/software/. How will you create it under there?

  • The environment has to be self-contained - what do you need to configure so nothing is installed locally under your /home?


Plan Your Workflow

Considerations:

  • Start understanding how you’re going to perform the analysis.

  • Where are you calling scripts from?

  • How do you activate your Conda environment and use the htseq-count command?

  • You’re asked to use four cores - how does the htseq-count command use these? Is there an option to set? How would you find this out?

  • Since you shouldn’t run computation on the login nodes, how would you use an interactive session to start testing this?


Submit the Job

Once you have you workflow planned and tested, transfer it into a submission script so that is can be summited to the cluster.

Review: Submit Jobs.


Analyze the Results

Considerations:

  • How can you view the start and end of a text file?


Answer

The follow answer is just one (of potentially many) possible approaches to implement this exercise.

If you’d like to review anything and/or discuss alternatives, please contact us and we’ll happily arrange to have a conversation.


Setup Structure Under a Project

  • This needs to be share-able across a project, so you can not create anything within your /home.

[]$ cd /project/<project-name>/ []$ mkdir exercise01 []$ cd exercise01/ []$ mkdir scripts data results []$ pwd /project/<project-name>/exercise01 []$ ls data results scripts

Get the Data

  • In this example we’re using the wget command to download the raw data.

[]$ cd data/ []$ wget https://raw.githubusercontent.com/htseq/htseq/main/example_data/bamfile_no_qualities.gtf []$ wget https://github.com/htseq/htseq/raw/main/example_data/bamfile_no_qualities.bam []$ ls -al ... -rw-r--r-- 1 <username> <project-name> 966147 Aug 29 09:46 bamfile_no_qualities.bam -rw-r--r-- 1 <username> <project-name> 282781 Aug 29 09:45 bamfile_no_qualities.gtf

Notice how the URLs are different. Depending on the type of data and how it is displayed from the repository, you might need to click around to find the raw version.


Create the Conda Environment

  • This needs to be share-able, so the suggestion is to install under: /project/<project-name>/software and then under a child folder called conda-envs to organizing Conda environments under the same location.

  • Although the installation documentation talks about using pip install and lists variation dependencies, you can always check if a Conda package is available. Also, trying Googling: bioconda / packages / htseq.

  • Take a note of the channel where this is found. Do you have this configured? Review Conda Channels.

[]$ cd /project/<project-name>/software []$ mkdir conda-envs []$ cd conda-envs/ []$ pwd /project/<project-name>/software/conda-envs []$ module load miniconda3/24.3.0 []$ conda search htseq Loading channels: done No match found for: htseq. Search: *htseq* []$ conda search bioconda::htseq Loading channels: done # Name Version Build Channel ... htseq 2.0.5 py39hd5189a5_1 bioconda []$ conda create -p htseq_2.0.5 ... # To activate this environment, use # $ conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5 []$ conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5 (/cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5) []$ conda install bioconda::htseq
  • Since there is a Conda package available, this should make installation of dependencies easier.

  • But, is this the latest version? When creating this exercise Conda was 2.0.5, but the pip version was 2.0.8 - is this significant?

  • Once installed, run the htseq-count command’s help and check its output to look for parallel related options.

() []$ conda install bioconda::htseq () []$ export PYTHONUSERBASE=$CONDA_PREFIX () []$ htseq-count -h ... -n NPROCESSES, --nprocesses NPROCESSES Number of parallel CPU processes to use (default: 1). This option is useful to process several input files at once. Each file will use only 1 CPU. It is possible, of course, to split a very large input SAM/BAM files into smaller chunks upstream to make use of this option. ... () []$ conda deactivate []$

Note: The use of:

export PYTHONUSERBASE=$CONDA_PREFIX

This Conda environment needs to be self-contained and not use/install anything under a user’s local /home.

For example, if you have a later version of numpy (say numpy/2.0.0) under ~/.local/lib/python/X.Y/ this will clash with the version within the Conda environment and you’ll see an error of the form:

() []$ htseq-count --help A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'. If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2. ...

Submit the Job

  • Once testing has been performed the workflow finalized, the next step is to translate this into a submission script.

[]$ cd /project/arcc/exercise01/ []$ cd scripts/ []$ vim run.sh
#!/bin/bash #SBATCH --job-name=htseq_analysis #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --account=<project-name> #SBATCH --time=10:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=<email-address> #SBATCH --output=output/htseq_output_%A.out echo "SLURM_JOB_ID:" $SLURM_JOB_ID echo "SLURM_JOB_NAME:" $SLURM_JOB_NAME echo "SLURM_JOB_PARTITION" $SLURM_JOB_PARTITION echo "SLURM_JOB_NUM_NODES:" $SLURM_JOB_NUM_NODES echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST echo "SLURM_CPUS_PER_TASK:" $SLURM_CPUS_PER_TASK module purge module load miniconda3/24.3.0 conda activate /cluster/medbow/project/<project-name>/software/conda-envs/htseq_2.0.5 export PYTHONUSERBASE=$CONDA_PREFIX echo "Starting Analysis:" htseq-count ../data/bamfile_no_qualities.bam ../data/bamfile_no_qualities.gtf -n 4 > ../results/count.txt echo "Done." conda deactivate
  • Submit this script and then notice the various outputs that relate to the relative paths defined:

[]$ sbatch run.sh Submitted batch job <job-id> []$ ls output/ htseq_output_<job-id>.out []$ ls ../results/ count.txt

Look at the Results

  • To confirm your results are correct, look at the Slurm job output, and the head/tail of the count.txt file:

[]$ cat output/htseq_output_<job-id>.out SLURM_JOB_ID: <job-id> SLURM_JOB_NAME: htseq_analysis SLURM_JOB_PARTITION inv-arcc SLURM_JOB_NUM_NODES: 1 SLURM_JOB_NODELIST: mbcpu-025 SLURM_CPUS_PER_TASK: 4 The following modules were not unloaded: (Use "module --force purge" to unload all): 1) slurm/latest 2) arcc/1.0 Starting Analysis: 2358 GFF lines processed. 9997 alignment records processed. Done.
16S_rRNA 2 23S_rRNA 9 5S_rRNA-1 0 5S_rRNA-2 0 TK0001 8 TK0002 0 TK0003 0 TK0004 0 TK0005 0 TK0006 0
tRNA-Tyr 0 tRNA-Val-1 0 tRNA-Val-2 0 tRNA-Val-3 0 tRNA-Val-4 0 __no_feature 290 __ambiguous 270 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 0

Exercise Extensions

To expand on this exercise:

  1. How would you create a module file to replace having to activate your Conda environment?

  2. How would you modify your workflow so that all analysis was performed under /gscratch - what does your workflow look like to move data from and to the project?

  3. If you had a number of htseq-count commands to perform how would you update your overall workflow?

    • What if you had 5 to perform?

    • What if you had 50 to perform?

    • What if you had 500 to perform?

    • What if you had 5000 to perform?

  4. How would you update the conda environment to update it so that is can be used as kernel within Jupyter?


 

Related content