Overview
SourceTracker2: Contamination is a critical issue in high-throughput metagenomic studies, yet progress toward a comprehensive solution has been limited. We present SourceTracker, a Bayesian approach to estimate the proportion of contaminants in a given community that come from possible source environments. We applied SourceTracker to microbial surveys from neonatal intensive care units (NICUs), offices and molecular biology laboratories, and provide a database of known contaminants for future testing.
Using
Use the module name barracuda to discover versions available and to load the application.
Example
[salexan5@tlog1 ~]$ sourcetracker2 gibbs --help Usage: sourcetracker2 gibbs [OPTIONS] Gibb's sampler for Bayesian estimation of microbial sample sources. For details, see the project README file. Options: -i, --table_fp FILE Path to input BIOM table. [required] -m, --mapping_fp FILE Path to sample metadata mapping file. [required] -o, --output_dir FILE Path to the output directory to be created. [required] --loo Classify each sample in `sources` using a leave-one-out strategy. Replicates -s option in Knights et al. sourcetracker. [default: False] --jobs INTEGER Number of processes to launch. [default: 1] --alpha1 FLOAT Prior counts of each species in the training environments. Higher values decrease the trust in the training environments, and make the source environment distrubitons over taxa smoother. By default, this is set to 0.001, which indicates reasonably high trust in all source environments, even those with few training sequences. This is useful when only a small number of biological samples are available from a source environment. A more conservative value would be 0.01. [default: 0.001] --alpha2 FLOAT Prior counts of each species in Unknown environment. Higher values make the Unknown environment smoother and less prone to overfitting given a training sample. [default: 0.001] --beta INTEGER Count to be added to each species in each environment, including `unknown`. [default: 10] --source_rarefaction_depth INTEGER Depth at which to rarify sources. If 0, no rarefaction performed. [default: 1000] --sink_rarefaction_depth INTEGER Depth at which to rarify sinks. If 0, no rarefaction performed. [default: 1000] --restarts INTEGER Number of independent Markov chains to grow. `draws_per_restart` * `restarts` gives the number of samplings of the mixing proportions that will be generated. [default: 10] --draws_per_restart INTEGER Number of times to sample the state of the Markov chain for each independent chain grown. [default: 1] --burnin INTEGER Number of passes (withdarawal and reassignment of every sequence in the sink) that will be made before a sample (draw) will be taken. Higher values allow more convergence towards the true distribtion before draws are taken. [default: 100] --delay INTEGER Number passes between each sampling (draw) of the Markov chain. Once the burnin passes have been made, a sample will be taken every `delay` number of passes. This is also known as `thinning`. Thinning helps reduce the impact of correlation between adjacent states of the Markov chain. [default: 10] --cluster_start_delay INTEGER When using multiple jobs, the script has to start an `ipcluster`. If ipcluster does not recognize that it has been successfully started, the jobs will not be successfully launched. If this is happening, increase this parameter. [default: 25] --source_sink_column TEXT Sample metadata column indicating which samples should be treated as sources and which as sinks. [default: SourceSink] --source_column_value TEXT Value in source_sink_column indicating which samples should be treated as sources. [default: source] --sink_column_value TEXT Value in source_sink_column indicating which samples should be treated as sinks. [default: sink] --source_category_column TEXT Sample metadata column indicating the type of each source sample. [default: Env] --help Show this message and exit.
Parallelization
The SourceTracker2 documentation indicates that jobs can be run in parallel using the --jobs options.
At this stage ARCC is unsure if this is actually working correctly.
Following the examples, and setting the following in the submission script:
#SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=5
The example that uses:
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5
actually runs slower than if --jobs 1
as jobs are taking significantly longer to run.
This could be because of the test data being used.
Examples
We have also noticed that the example:
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments
fails with error message:
Error: no such option: --per_sink_feature_assignments
We would welcome any comments / observations from users which we can use to update these pages.