Kraken
Overview
KRAKEN2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm. In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single-core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken's accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.
Using
Use the module name kraken
to discover versions available and to load the application.
Note: Under the System Requirements within the Dependencies section, it talks about Multithreading is handled using OpenMP. ... Unlike Kraken 1, Kraken 2 does not use an external k-mer counter. However, by default, Kraken 2 will attempt to use the dustmasker or segmasker programs provided as part of NCBI's BLAST suite to mask low-complexity regions (see Masking of Low-complexity Sequences).
Multicore
The kraken2
application can run across multiple threads. Look at the commands usage bkraken2 -h
for further details on the --threads
option.
Example
Example based on Standard Kraken 2 Database. With respect to the above, you'll notice in the example below that it also uses the gpu-blast/1.1
The threads
option value must match the cpus-per-task
value.
Kraken2, while processing the kraken2-build
command requires in its path the dustmasker
and segmasker
programs provided as part of NCBI's BLAST suite.
To enable this, as part of the module load process we automatically load the following:
Teton:
gpu-blast/1.1
Beartooth:
blast-plus/2.13.0
Kraken Databases
There are a number of reference database resources available, and we are happy to extend this list as researchers find useful links. Please email useful resources to arcc-help@uwyo.edu
You can download these locally to your home/project space, or approach ARCC and we can discuss whether we can make these a more global resource.
If downloaded locally then you’ll need to indicate to kraken2 where the database is in your home/project space. For example:
kraken2 -db ~/SupportingData/kraken2/k2_pluspfp_20210127 \
--threads 32 \
Starter_examples-246915672/SMS2_S2_L001_R1_001.fastq.gz Starter_examples-246915672/SMS2_S2_L001_R2_001.fastq.gz \
--use-names --minimum-base-quality 25 --confidence 0.5 \
--report 05SMS2ReportConf25.txt --output xxx.txt
Issues
rsync vs ftp
Under the hood, kraken2 uses rsync to download data. During testing we have experienced various issues while downloading databases.
For example:
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.10): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.12): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41f:250::229): Network is unreachable (101)
rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::11): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.3]
There are various issue pages discussing the problem (such as here), with the suggestion to use the --use-ftp
option within the. See kraken2-build --help
for further details.