Gentools

Overview

As part of a July 2022 workshop being delivered on the WildIris cluster, the following group of packages have been made available as a single module environment.

Application / Package

Version

Notes:

Application / Package

Version

Notes:

blobtools

1.1.1

A modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets.

blobtools -h

bedtools

2.30.0

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

bedtools --help

canu

1.4

A fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION).

canu

filtlong

0.2.1

A tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset. It uses both read length (longer is better) and read identity (higher is better) when choosing which reads pass the filter.

filtlong -h

LINKS

2.0.1

A genomics application for scaffolding genome assemblies with long reads, such as those produced by Oxford Nanopore Technologies Ltd. It can be used to scaffold high-quality draft genome assemblies with any long sequences (eg. ONT reads, PacBio reads, other draft genomes, etc). It is also used to scaffold contig pairs linked by ARCS/ARKS.

LINKS

masurca

3.3.0

The MaSuRCA (Maryland Super Read Cabog Assembler) genome assembly and analysis toolkit contains of MaSuRCA genome assembler, QuORUM error corrector for Illumina data, POLCA genome polishing software, Chromosome scaffolder, jellyfish mer counter, and MUMmer aligner.

masurca -h

medaka

1.6.1

A tool to create consensus sequences and variant calls from nanopore sequencing data.

medaka_consensus -h

Notes:

1: Due to use of TensorFlow requires a Physical Node. Use partition=wildiris-phys

If you do try running it on one of the virtual nodes, you will see the following:

[salexan5@wi001 ~]$ medaka_consensus The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine. /apps/u/opt/gentools/1.0.0/bin/medaka_consensus: line 16: 17197 Aborted (core dumped) medaka tools list_models ...

2: Although medaka (via Tensorflow) can use GPUs, the WildIris cluster does not have any GPUs. When running you will see the following which can be ignored:

[salexan5@wi005 ~]$ medaka_consensus 2022-07-06 09:18:37.339581: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/s/slurm/21.08/lib64:/apps/s/slurm/21.08/lib 2022-07-06 09:18:37.339601: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. medaka 1.6.1 ------------ Assembly polishing via neural networks. Medaka is optimized to work with the Flye assembler. ...

3: There is a note that details samtools/bgzip/tabix version 1.14 and minimap2 version 2.17 are recommended as these are those used in development of medaka, you’ll also need bcftools for various commands.

  • versions 1.15 of samtools, tabix, and bgzip are available via the samtools module.

  • versions 1.15 of bcftools is available via the bcftools module.

  • version 2.17 of minimap2 is available within this collection

For example, if you try running some like the following without the modules loaded you’ll see:

[salexan5@wi005 ~]$ medaka_haploid_variant -i <fastx_file> -r <fasta_file> ... Checking program versions This is medaka 1.6.1 [main] unrecognized command '--version' Program Version Required Pass bcftools Not found 1.11 False bgzip Not found 1.11 False minimap2 2.17 2.11 True samtools Not found 1.11 False tabix Not found 1.11 False

Once loaded, you’ll see:

 

miniasm

0.3-r179

A very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format.

miniasm -h

minimap2

2.17

A versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.

minimap2 -h

NanoPlot

1.40.0

Plotting tool for long read sequencing data and alignments.

NanoPlot -h

pilon

1.24

A software tool which can be used to:

  • Automatically improve draft assemblies

  • Find variation among strains, including large event detection

pilon --help

Note: According to the pilon requirements documentation, the tool requires a minimum of 8G to run. To accommodate this, when submitting a job using sbatch or creating an interactive session with salloc, please use --mem=8G.

If your data requires more than 8G, then you’ll need to use an alternative command-line. The example below demonstrates using --mem=16G:

Note how the --mem=16G matches -Xmx16G.

 

porechop

0.2.4

A a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

porechop -h

racon

1.4.20

Is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step.

racon -h

tabview

1.4.3

A curses command-line CSV and list (tabular data) viewer.

tabview -h

Using

Use the module name gentools to discover versions available and to load the application.

Additional Standalone Versions

Additionally, the following application have their own standalone / module versions. This is due to their newer versions not being able to be installed in the above collective environment.

Application / Package

Version

Notes:

Application / Package

Version

Notes:

canu

2.2

Use the module name canu to discover versions available and to load the application.

masurca

4.0.9

Use the module name masurca to discover versions available and to load the application.