QUAST
- Joshua Cook (Unlicensed)
Overview
QUAST is a genome assembly evaluation tool. QUAST evaluates genome assemblies by computing various metrics. It works both with and without reference genomes. The tool accepts multiple assemblies, thus is suitable for comparison.
Using
Use the module name quast
to discover versions available and to load the application.
Example
[]$ quast --help
QUAST: Quality Assessment Tool for Genome Assemblies
Version: 5.0.2
Usage: python /apps/u/opt/quast/5.0.2/bin/quast [options] <files_with_contigs>
Options:
-o --output-dir <dirname> Directory to store all result files [default: quast_results/results_<datetime>]
-r <filename> Reference genome file
-g --features [type:]<filename> File with genomic feature coordinates in the reference (GFF, BED, NCBI or TXT)
Optional 'type' can be specified for extracting only a specific feature type from GFF
-m --min-contig <int> Lower threshold for contig length [default: 500]
-t --threads <int> Maximum number of threads [default: 25% of CPUs]
Advanced options:
-s --split-scaffolds Split assemblies by continuous fragments of N's and add such "contigs" to the comparison
-l --labels "label, label, ..." Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
-L Take assembly names from their parent directory names
-e --eukaryote Genome is eukaryotic (primarily affects gene prediction)
--fungus Genome is fungal (primarily affects gene prediction)
--large Use optimal parameters for evaluation of large genomes
In particular, imposes '-e -m 3000 -i 500 -x 7000' (can be overridden manually)
-k --k-mer-stats Compute k-mer-based quality metrics (recommended for large genomes)
This may significantly increase memory and time consumption on large genomes
--k-mer-size Size of k used in --k-mer-stats [default: 101]
--circos Draw Circos plot
-f --gene-finding Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)
--mgm Use MetaGeneMark for gene prediction (instead of the default finder, see above)
--glimmer Use GlimmerHMM for gene prediction (instead of the default finder, see above)
--gene-thresholds <int,int,...> Comma-separated list of threshold lengths of genes to search with Gene Finding module
[default: 0,300,1500,3000]
--rna-finding Predict ribosomal RNA genes using Barrnap
-b --conserved-genes-finding Count conserved orthologs using BUSCO (only on Linux)
--operons <filename> File with operon coordinates in the reference (GFF, BED, NCBI or TXT)
--est-ref-size <int> Estimated reference size (for computing NGx metrics without a reference)
--contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]
-u --use-all-alignments Compute genome fraction, # genes, # operons in QUAST v1.* style.
By default, QUAST filters Minimap's alignments to keep only best ones
-i --min-alignment <int> The minimum alignment length [default: 65]
--min-identity <float> The minimum alignment identity (80.0, 100.0) [default: 95.0]
-a --ambiguity-usage <none|one|all> Use none, one, or all alignments of a contig when all of them
are almost equally good (see --ambiguity-score) [default: one]
--ambiguity-score <float> Score S for defining equally good alignments of a single contig. All alignments are sorted
by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are
discarded. S should be between 0.8 and 1.0 [default: 0.99]
--strict-NA Break contigs in any misassembly event when compute NAx and NGAx.
By default, QUAST breaks contigs only by extensive misassemblies (not local ones)
-x --extensive-mis-size <int> Lower threshold for extensive misassembly size. All relocations with inconsistency
less than extensive-mis-size are counted as local misassemblies [default: 1000]
--scaffold-gap-max-size <int> Max allowed scaffold gap length difference. All relocations with inconsistency
less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]
--unaligned-part-size <int> Lower threshold for detecting partially unaligned contigs. Such contig should have
at least one unaligned fragment >= the threshold [default: 500]
--skip-unaligned-mis-contigs Do not distinguish contigs with >= 50% unaligned bases as a separate group
By default, QUAST does not count misassemblies in them
--fragmented Reference genome may be fragmented into small pieces (e.g. scaffolded reference)
--fragmented-max-indent <int> Mark translocation as fake if both alignments are located no further than N bases
from the ends of the reference fragments [default: 85]
Requires --fragmented option
--upper-bound-assembly Simulate upper bound assembly based on the reference genome and reads
--upper-bound-min-con <int> Minimal number of 'connecting reads' needed for joining upper bound contigs into a scaffold
[default: 2 for mate-pairs and 1 for long reads]
--est-insert-size <int> Use provided insert size in upper bound assembly simulation [default: auto detect from reads or 255]
--plots-format <str> Save plots in specified format [default: pdf].
Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz
--memory-efficient Run everything using one thread, separately per each assembly.
This may significantly reduce memory consumption on large genomes
--space-efficient Create only reports and plots files. Aux files including .stdout, .stderr, .coords will not be created.
This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built
-1 --pe1 <filename> File with forward paired-end reads (in FASTQ format, may be gzipped)
-2 --pe2 <filename> File with reverse paired-end reads (in FASTQ format, may be gzipped)
--pe12 <filename> File with interlaced forward and reverse paired-end reads. (in FASTQ format, may be gzipped)
--mp1 <filename> File with forward mate-pair reads (in FASTQ format, may be gzipped)
--mp2 <filename> File with reverse mate-pair reads (in FASTQ format, may be gzipped)
--mp12 <filename> File with interlaced forward and reverse mate-pair reads (in FASTQ format, may be gzipped)
--single <filename> File with unpaired short reads (in FASTQ format, may be gzipped)
--pacbio <filename> File with PacBio reads (in FASTQ format, may be gzipped)
--nanopore <filename> File with Oxford Nanopore reads (in FASTQ format, may be gzipped)
--ref-sam <filename> SAM alignment file obtained by aligning reads to reference genome file
--ref-bam <filename> BAM alignment file obtained by aligning reads to reference genome file
--sam <filename,filename,...> Comma-separated list of SAM alignment files obtained by aligning reads to assemblies
(use the same order as for files with contigs)
--bam <filename,filename,...> Comma-separated list of BAM alignment files obtained by aligning reads to assemblies
(use the same order as for files with contigs)
Reads (or SAM/BAM file) are used for structural variation detection and
coverage histogram building in Icarus
--sv-bedpe <filename> File with structural variations (in BEDPE format)
Speedup options:
--no-check Do not check and correct input fasta files. Use at your own risk (see manual)
--no-plots Do not draw plots
--no-html Do not build html reports and Icarus viewers
--no-icarus Do not build Icarus viewers
--no-snps Do not report SNPs (may significantly reduce memory consumption on large genomes)
--no-gc Do not compute GC% and GC-distribution
--no-sv Do not run structural variation detection (make sense only if reads are specified)
--no-gzip Do not compress large output files
--no-read-stats Do not align reads to assemblies
Reads will be aligned to reference and used for coverage analysis,
upper bound assembly simulation, and structural variation detection.
Use this option if you do not need read statistics for assemblies.
--fast A combination of all speedup options except --no-check
Other:
--silent Do not print detailed information about each step to stdout (log file is not affected)
--test Run QUAST on the data from the test_data folder, output to quast_test_output
--test-sv Run QUAST with structural variants detection on the data from the test_data folder,
output to quast_test_output
-h --help Print full usage message
-v --version Print version
Online QUAST manual is available at http://quast.sf.net/manual
Using Multiple Cores/Threads
The quast application allows you define multiple cores/threads within calls using the '-t or --threads for number of threads allocated to the program
' options.
After downloading the example files, a batch script would take the following form where we are defining and using 8 cores (on a single node).
...
#SBATCH --cpus-per-task=8
...
quast --threads 8 ./test_data/contigs_1.fasta ./test_data/contigs_2.fasta -r ./test_data/reference.fasta.gz -g ./test_data/genes.gff
The best number of cores/threads and memory to use for a particular job will depend on the data and is up to the researcher to explore.
Please feel free to report/share any interesting observations and we can update this page.