Kraken

1 Overview | 2 Using | 2.1 Multicore | 2.2 Example | 2.3 Kraken Databases | 2.4 Issues | 2.4.1 rsync vs ftp

Overview

KRAKEN2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.

Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm. In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single-core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken's accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.

Using

Use the module name kraken to discover versions available and to load the application.

Note: Under the System Requirements within the Dependencies section, it talks about Multithreading is handled using OpenMP. ... Unlike Kraken 1, Kraken 2 does not use an external k-mer counter. However, by default, Kraken 2 will attempt to use the dustmasker or segmasker programs provided as part of NCBI's BLAST suite to mask low-complexity regions (see Masking of Low-complexity Sequences).

Multicore

The kraken2 application can run across multiple threads. Look at the commands usage bkraken2 -h for further details on the --threads option.

Example

Example based on Standard Kraken 2 Database. With respect to the above, you'll notice in the example below that it also uses the gpu-blast/1.1

[]$ salloc -A <enter-your-project> --time=6:00:00 -N 1 --cpus-per-task=32 --mem=0 []$ module load kraken/2.0 []$ module load gpu-blast/1.1 []$ srun kraken2-build --standard --threads 32 --db KDB Step 1/2: Performing rsync file transfer of requested files Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences Processed 341 projects (530 sequences, 872.20 Mbp)... done. All files processed, cleaning up extra sequence files... done, library complete. Masking low-complexity regions of downloaded library... done. Step 1/2: Performing rsync file transfer of requested files Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences Processed 17072 projects (36839 sequences, 68.61 Gbp)... done. All files processed, cleaning up extra sequence files... done, library complete. Masking low-complexity regions of downloaded library... done. Step 1/2: Performing rsync file transfer of requested files Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences Processed 9331 projects (11953 sequences, 314.52 Mbp)... done. All files processed, cleaning up extra sequence files... done, library complete. Masking low-complexity regions of downloaded library... done. mv: try to overwrite ‘assembly_summary.txt’, overriding mode 0444 (r--r--r--)? y Step 1/2: Performing rsync file transfer of requested files Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences Processed 1 project (639 sequences, 3.27 Gbp)... done. All files processed, cleaning up extra sequence files... done, library complete. Downloading UniVec_Core data from server... done. Adding taxonomy ID of 28384 to all sequences... done. Masking low-complexity regions of downloaded library... done. Creating sequence ID to taxonomy ID map (step 1)... Sequence ID to taxonomy ID map complete. [0.049s] Estimating required capacity (step 2)... Estimated hash table requirement: 42273822720 bytes Capacity estimation complete. [10m27.202s] Building database files (step 3)... Taxonomy parsed and converted. CHT created with 15 bits reserved for taxid. Completed processing of 53095 sequences, 73070206125 bp Writing data to disk... complete. Database files completed. [1h7m47.355s] Database construction complete. [Total: 1h18m8.140s]

The threads option value must match the cpus-per-task value.

Kraken2, while processing the kraken2-build command requires in its path the dustmasker and segmasker programs provided as part of NCBI's BLAST suite.

To enable this, as part of the module load process we automatically load the following:

  • Teton: gpu-blast/1.1

  • Beartooth: blast-plus/2.13.0

Kraken Databases

There are a number of reference database resources available, and we are happy to extend this list as researchers find useful links. Please email useful resources to arcc-help@uwyo.edu

You can download these locally to your home/project space, or approach ARCC and we can discuss whether we can make these a more global resource.

If downloaded locally then you’ll need to indicate to kraken2 where the database is in your home/project space. For example:

kraken2 -db ~/SupportingData/kraken2/k2_pluspfp_20210127 \ --threads 32 \ Starter_examples-246915672/SMS2_S2_L001_R1_001.fastq.gz Starter_examples-246915672/SMS2_S2_L001_R2_001.fastq.gz \ --use-names --minimum-base-quality 25 --confidence 0.5 \ --report 05SMS2ReportConf25.txt --output xxx.txt

Issues

rsync vs ftp

Under the hood, kraken2 uses rsync to download data. During testing we have experienced various issues while downloading databases.

For example:

rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.10): Connection refused (111) rsync: failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.12): Connection refused (111) rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41f:250::229): Network is unreachable (101) rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::11): Network is unreachable (101) rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.3]

There are various issue pages discussing the problem (such as here), with the suggestion to use the --use-ftp option within the. See kraken2-build --help for further details.