SourceTracker2

Overview

SourceTracker2: Contamination is a critical issue in high-throughput metagenomic studies, yet progress toward a comprehensive solution has been limited. We present SourceTracker, a Bayesian approach to estimate the proportion of contaminants in a given community that come from possible source environments. We applied SourceTracker to microbial surveys from neonatal intensive care units (NICUs), offices and molecular biology laboratories, and provide a database of known contaminants for future testing.

Using

Use the module name barracuda to discover versions available and to load the application.

Example

[salexan5@tlog1 ~]$ sourcetracker2 gibbs --help
Usage: sourcetracker2 gibbs [OPTIONS]

  Gibb's sampler for Bayesian estimation of microbial sample sources.

  For details, see the project README file.

Options:
  -i, --table_fp FILE             Path to input BIOM table.  [required]
  -m, --mapping_fp FILE           Path to sample metadata mapping file.
                                  [required]

  -o, --output_dir FILE           Path to the output directory to be created.
                                  [required]

  --loo                           Classify each sample in `sources` using a
                                  leave-one-out strategy. Replicates -s option
                                  in Knights et al. sourcetracker.  [default:
                                  False]

  --jobs INTEGER                  Number of processes to launch.  [default: 1]
  --alpha1 FLOAT                  Prior counts of each species in the training
                                  environments. Higher values decrease the
                                  trust in the training environments, and make
                                  the source environment distrubitons over
                                  taxa smoother. By default, this is set to
                                  0.001, which indicates reasonably high trust
                                  in all source environments, even those with
                                  few training sequences. This is useful when
                                  only a small number of biological samples
                                  are available from a source environment. A
                                  more conservative value would be 0.01.
                                  [default: 0.001]

  --alpha2 FLOAT                  Prior counts of each species in Unknown
                                  environment. Higher values make the Unknown
                                  environment smoother and less prone to
                                  overfitting given a training sample.
                                  [default: 0.001]

  --beta INTEGER                  Count to be added to each species in each
                                  environment, including `unknown`.  [default:
                                  10]

  --source_rarefaction_depth INTEGER
                                  Depth at which to rarify sources. If 0, no
                                  rarefaction performed.  [default: 1000]

  --sink_rarefaction_depth INTEGER
                                  Depth at which to rarify sinks. If 0, no
                                  rarefaction performed.  [default: 1000]

  --restarts INTEGER              Number of independent Markov chains to grow.
                                  `draws_per_restart` * `restarts` gives the
                                  number of samplings of the mixing
                                  proportions that will be generated.
                                  [default: 10]

  --draws_per_restart INTEGER     Number of times to sample the state of the
                                  Markov chain for each independent chain
                                  grown.  [default: 1]

  --burnin INTEGER                Number of passes (withdarawal and
                                  reassignment of every sequence in the sink)
                                  that will be made before a sample (draw)
                                  will be taken. Higher values allow more
                                  convergence towards the true distribtion
                                  before draws are taken.  [default: 100]

  --delay INTEGER                 Number passes between each sampling (draw)
                                  of the Markov chain. Once the burnin passes
                                  have been made, a sample will be taken every
                                  `delay` number of passes. This is also known
                                  as `thinning`. Thinning helps reduce the
                                  impact of correlation between adjacent
                                  states of the Markov chain.  [default: 10]

  --cluster_start_delay INTEGER   When using multiple jobs, the script has to
                                  start an `ipcluster`. If ipcluster does not
                                  recognize that it has been successfully
                                  started, the jobs will not be successfully
                                  launched. If this is happening, increase
                                  this parameter.  [default: 25]

  --source_sink_column TEXT       Sample metadata column indicating which
                                  samples should be treated as sources and
                                  which as sinks.  [default: SourceSink]

  --source_column_value TEXT      Value in source_sink_column indicating which
                                  samples should be treated as sources.
                                  [default: source]

  --sink_column_value TEXT        Value in source_sink_column indicating which
                                  samples should be treated as sinks.
                                  [default: sink]

  --source_category_column TEXT   Sample metadata column indicating the type
                                  of each source sample.  [default: Env]

  --help                          Show this message and exit.

Parallelization

The SourceTracker2 documentation indicates that jobs can be run in parallel using the --jobs options.

At this stage ARCC is unsure if this is actually working correctly.

Following the examples, and setting the following in the submission script:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5

The example that uses:

sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5

actually runs slower than if --jobs 1 as jobs are taking significantly longer to run.

This could be because of the test data being used.

Examples

We have also noticed that the example:

sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments

fails with error message:

Error: no such option: --per_sink_feature_assignments

We would welcome any comments / observations from users which we can use to update these pages.