Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Use the module name sourcetracker2 to discover versions available and to load the application.

...

Beartooth

Compared to earlier versions on Teton, the gibbs command is NOT required:

Code Block
[salexan5@tlog1 ~]$ sourcetracker2 gibbs --help
Usage: sourcetracker2 gibbs [OPTIONS]

  Gibb's sampler for Bayesian estimation of microbial sample sources.

  For details, see the project README file.

Options:
  -i, --table_fp FILE             Path to input BIOM table.  [required]
 otu_table.biom -m, --mapping_fp FILE           Path to sample metadata mapping file.
                                  [required]

  -o, --output_dir FILE           Path to the output directory to be created.
                                  [required]

  --loo                           Classify each sample in `sources` using a
                                  leave-one-out strategy. Replicates -s option
                                  in Knights et al. sourcetracker.  [default:
                                  False]

  --jobs INTEGER                  Number of processes to launch.  [default: 1]
  --alpha1 FLOAT                  Prior counts of each species in the training
                                  environments. Higher values decrease the
                                  trust in the training environments, and make
                                  the source environment distrubitons over
                                  taxa smoother. By default, this is set to
                                  0.001, which indicates reasonably high trust
                                  in all source environments, even those with
                                  few training sequences. This is useful when
                                  only a small number of biological samples
                                  are available from a source environment. A
                                  more conservative value would be 0.01.
                                  [default: 0.001]

  --alpha2 FLOAT                  Prior counts of each species in Unknown
                                  environment. Higher values make the Unknown
                                  environment smoother and less prone to
                                  overfitting given a training sample.
                                  [default: 0.001]

  --beta INTEGER                  Count to be added to each species in each
                                  environment, including `unknown`.  [default:
                                  10]

  --source_rarefaction_depth INTEGER
                                  Depth at which to rarify sources. If 0, no
                                  rarefaction performed.  [default: 1000]

  --sink_rarefaction_depth INTEGER
                                  Depth at which to rarify sinks. If 0, no
                                  rarefaction performed.  [default: 1000]

  --restarts INTEGER              Number of independent Markov chains to grow.
                                  `draws_per_restart` * `restarts` gives the
                                  number of samplings of the mixing
                                  proportions that will be generated.
                                  [default: 10]

  --draws_per_restart INTEGER     Number of times to sample the state of the
                                  Markov chain for each independent chain
                                  grown.  [default: 1]

  --burnin INTEGER                Number of passes (withdarawal and
                                  reassignment of every sequence in the sink)
                                  that will be made before a sample (draw)
                                  will be taken. Higher values allow more
                                  convergence towards the true distribtion
                                  before draws are taken.  [default: 100]

  --delay INTEGER                 Number passes between each sampling (draw)
                                  of the Markov chain. Once the burnin passes
                                  have been made, a sample will be taken every
                                  `delay` number of passes. This is also known
                                  as `thinning`. Thinning helps reduce the
                                  impact of correlation between adjacent
                                  states of the Markov chain.  [default: 10]

  --cluster_start_delay INTEGER   When using multiple jobs, the script has to map.txt -o example1/

Teton

On Teton, the gibbs command has to be used:

Code Block
[]$ sourcetracker2 --help
Usage: sourcetracker2 [OPTIONS] COMMAND [ARGS]...
Options:
  --version  Show the version and exit.
  --help     Show this message and exit.
Commands:
  gibbs  Gibb's sampler for Bayesian estimation of microbial sample sources.

[]$ sourcetracker2 gibbs --help
Usage: sourcetracker2 gibbs [OPTIONS]
  Gibb's sampler for Bayesian estimation of microbial sample sources.
  For details, see the project README file.
Options:
  -i, --table_fp FILE             Path to input                       start an `ipcluster`. If ipcluster does not
                                  recognize that it has been successfully
                                  started, the jobs will not be successfully
                                  launched. If this is happening, increase
                                  this parameter.  [default: 25]

  --source_sink_column TEXT       Sample metadata column indicating which
                                  samples should be treated as sources and
                                  which as sinks.  [default: SourceSink]

  --source_column_value TEXT      Value in source_sink_column indicating which
                                  samples should be treated as sources.
                                  [default: source]

  --sink_column_value TEXT        Value in source_sink_column indicating which
                                  samples should be treated as sinks.
                                  [default: sink]

  --source_category_column TEXT   Sample metadata column indicating the type
                                  of each source sample.  [default: Env]

  --help                          Show this message and exit.

Parallelization

The SourceTracker2 documentation indicates that jobs can be run in parallel using the --jobs options.

At this stage ARCC is unsure if this is actually working correctly.

Following the examples, and setting the following in the submission script:

Code Block
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5

The example that uses:

Code Block
BIOM table.  [required]
...

# Example
[]$ sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5

actually runs slower than if --jobs 1 as jobs are taking significantly longer to run.

This could be because of the test data being used.

Examples

...

example1/

Error: No --per_sink_feature_assignments option

Following the examples on the website: Running the following on Teton:

Code Block
sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments

...

Code Block
Error: no such option: --per_sink_feature_assignments

...

Multicore

The sourcetracker2command can be ran with multiple threads, see the sourcetracker2 --help for more details on the --jobs option.

Example: Beartooth:

Code Block
#SBATCH --cpus-per-task=16

The example that uses:

Code Block
sourcetracker2 -i otu_table.biom -m map.txt -o example6/ --jobs 16