Overview
Using
The RoseTTAFold environment is a combination of conda environments, python packages, scripts, commands and data.
Although the RoseTTAFold environment can be used across a cluster, it is not currently designed to straight forwardly run on a cluster out of the box.
This page will suggest how to use locally and what ARCC infrastructure has provided to make using the provided pipelines more convenient.
Getting Started
Within your home or project folder, you will need to clone the main git repository and install the csblast
and lddt
dependencies:
# Clone python related scripts.
[@blog2 testing]$ git clone https://github.com/RosettaCommons/RoseTTAFold.git
[@blog2 testing]$ cd RoseTTAFold/
# Install csblast and lddt applications.
[@blog2 RoseTTAFold]$ ./install_dependencies.sh
Why Locally? We have noticed during testing, that some of the provided scripts appear to require access and write to some internal child folders. This could potentially affect multiple users running concurrently from the same RoseTTAFold folder.
Module and sequence and structure database data
As details on the main RoseTTAFold GitHub page, you can download sequence and structure database data. For convenience we have downloaded this data that currently is >2.2T.
To expose this central location, and to allow future updates, use the module name rosettafold
to discover the versions available. This will setup your environment with the following ROSETTA_DATA
environment variable which can be used to access this pre-download data - for example:
data files available:
[salexan5@ttest01 rosettafold]$ module load rosettafold/1.1.0
[salexan5@ttest01 rosettafold]$ ls -R $ROSETTA_DATA
/pfs/tc1/udata/rosettafold/:
bfd pdb100_2021Mar03 UniRef30_2020_06 weights
/pfs/tc1/udata/rosettafold/bfd:
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
/pfs/tc1/udata/rosettafold/pdb100_2021Mar03:
LICENSE pdb100_2021Mar03_a3m.ffindex pdb100_2021Mar03_cs219.ffindex pdb100_2021Mar03_hhm.ffindex pdb100_2021Mar03_pdb.ffindex
pdb100_2021Mar03_a3m.ffdata pdb100_2021Mar03_cs219.ffdata pdb100_2021Mar03_hhm.ffdata pdb100_2021Mar03_pdb.ffdata
/pfs/tc1/udata/rosettafold/UniRef30_2020_06:
UniRef30_2020_06_a3m.ffdata UniRef30_2020_06_cs219.ffdata UniRef30_2020_06_hhm.ffdata UniRef30_2020_06.md5sums
UniRef30_2020_06_a3m.ffindex UniRef30_2020_06_cs219.ffindex UniRef30_2020_06_hhm.ffindex
/pfs/tc1/udata/rosettafold/weights:
RF2t.pt Rosetta-DL_LICENSE.txt RoseTTAFold_e2e.pt RoseTTAFold_pyrosetta.pt
Conda Environments
The RoseTTAFold pipeline appears to be based upon two conda environments that ARCC infrastructure has pre-build.
To use, you will first need to module load miniconda3/23.1.0
and then activate either/or:
conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/RoseTTAFold
Available RoseTTAFold Related Commands
2to3 chkparse ffindex_unpack infocmp nettle-hash psicc tabs webpmux
2to3-3.8 cjpeg ffmpeg infotocap nettle-lfib-stream psipass2 tclsh wheel
a3m_database_extract clear ffprobe instmodsh nettle-pbkdf2 psipred tclsh8.6 wish
a3m_database_filter community formatdb jpegtran ninja psktool tic wish8.6
a3m_database_reduce convert-caffe2-to-onnx formatrpsdb jpgicc ocsptool ptar tiff2bw wrjpgcom
a3m_extract convert-onnx-to-caffe2 freetype-config json_pp openssl ptardiff tiff2pdf x86_64-conda_cos6-linux-gnu-ld
a3m_reduce copymat gnutls-cli lame pal2rgb ptargrep tiff2ps x86_64-conda-linux-gnu-ld
asn1Coding corelist gnutls-cli-debug libnetcfg perl pydoc tiff2rgba xsubpp
asn1Decoding cpan gnutls-serv libpng16-config perl5.26.2 pydoc3 tiffcmp xz
asn1Parser c_rehash h264dec libpng-config perlbug pydoc3.8 tiffcp xzcat
bl2seq cstranslate h264enc linkicc perldoc python tiffcrop xzcmp
blastall djpeg h2ph lz4 perlivp python3 tiffdither xzdec
blastclust enc2xs h2xs lz4c perlthanks python3.8 tiffdump xzdiff
blastpgp encguess hhalign lz4cat piconv python3.8-config tiffinfo xzegrep
bunzip2 f2py hhalign_omp lzcat pip python3-config tiffmedian xzfgrep
bzcat f2py3 hhblits lzcmp pip3 raw2tiff tiffset xzgrep
bzcmp f2py3.8 hhblits_ca3m lzdiff pkcs1-conv rdjpgcom tiffsplit xzless
bzdiff fastacmd hhblits_omp lzegrep pl2pm reset tificc xzmore
bzegrep fax2ps hhconsensus lzfgrep pngfix rpsblast toe zipdetails
bzfgrep fax2tiff hhfilter lzgrep png-fix-itxt run_psipred.pl tput zstd
bzgrep ffindex_apply hhmake lzless pod2html seedtop tqdm zstdcat
bzip2 ffindex_build hhsearch lzma pod2man seq2mtx transicc zstdgrep
bzip2recover ffindex_from_fasta hhsearch_omp lzmadec pod2text sexp-conv tset zstdless
bzless ffindex_from_fasta_with_split iconv lzmainfo pod2usage shasum unlz4 zstdmt
bzmore ffindex_get idle3 lzmore podchecker splain unlzma
captoinfo ffindex_modify idle3.8 makemat podselect sqlite3 unxz
certtool ffindex_order idn2 megablast ppm2tiff sqlite3_analyzer unzstd
chardetect ffindex_reduce impala ncursesw6-config prove srptool webpinfo
conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/Folding
Available Folding Related Commands
2to3 env_parallel.ksh h2ph h5repack lzless perlivp pydoc3.7 streamzip x86_64-conda_cos7-linux-gnu-ld
2to3-3.7 env_parallel.mksh h2xs h5repart lzma perlthanks python tabs x86_64-conda-linux-gnu-ld
acountry env_parallel.pdksh h52gif h5stat lzmadec piconv python3 tclsh xsubpp
adig env_parallel.sh h5c++ h5unjam lzmainfo pip python3.7 tclsh8.6 xz
ahost env_parallel.tcsh h5cc h5watch lzmore pip3 python3.7-config tensorboard xzcat
captoinfo env_parallel.zsh h5clear idle3 markdown_py pl2pm python3.7m tflite_convert xzcmp
clear f2py h5copy idle3.7 matplotlib pod2html python3.7m-config tf_upgrade_v2 xzdec
corelist f2py3 h5debug infocmp ncursesw6-config pod2man python3-config tic xzdiff
cpan f2py3.7 h5diff infotocap niceload pod2text pyvenv toco xzegrep
c_rehash fftwf-wisdom h5dump instmodsh openssl pod2usage pyvenv-3.7 toco_from_protos xzfgrep
enc2xs fftwl-wisdom h5fc json_pp parallel podchecker reset toe xzgrep
encguess fftw-wisdom h5format_convert libnetcfg parcat protoc saved_model_cli tput xzless
env_parallel fftw-wisdom-to-conf h5import lzcat parset prove sem tset xzmore
env_parallel.ash freeze_graph h5jam lzcmp parsort ptar shasum unlzma zipdetails
env_parallel.bash gdbm_dump h5ls lzdiff perl ptardiff splain unxz
env_parallel.csh gdbm_load h5mkgrp lzegrep perl5.34.0 ptargrep sql wheel
env_parallel.dash gdbmtool h5perf_serial lzfgrep perlbug pydoc sqlite3 wish
env_parallel.fish gif2h5 h5redeploy lzgrep perldoc pydoc3 sqlite3_analyzer wish8.6
Provided pipeline script updates
As detailed on the main RoseTTAFold page “The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works.”
If you wish to use the provided pipelines on the Beartooth cluster using the pre-built conda environments and centralized data you will need to modify the above and following scripts:
input_prep/make_msa.sh
Line 14:
From:
# sequence databases
declare -a DATABASES=( \
"$PIPEDIR/UniRef30_2020_06/UniRef30_2020_06" \
"$PIPEDIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt")
To:
# sequence databases
declare -a DATABASES=( \
"$ROSETTA_DATA/UniRef30_2020_06/UniRef30_2020_06" \
"$ROSETTA_DATA/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt")
run_e2e_ver_arcc.sh
Line 30
From: conda activate RoseTTAFold
To : conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/RoseTTAFold
Line 54
From: DB="$PIPEDIR/pdb100_2021Mar03/pdb100_2021Mar03"
To : DB="$ROSETTA_DATA/pdb100_2021Mar03/pdb100_2021Mar03"
Line 71:
From: -m $PIPEDIR/weights
To : -m $ROSETTA_DATA/weights
run_pyrosetta_ver_arcc.sh
Line 30
From: conda activate Folding
To : conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/RoseTTAFold
Line 54
From: DB="$PIPEDIR/pdb100_2021Mar03/pdb100_2021Mar03"
To : DB="$ROSETTA_DATA/pdb100_2021Mar03/pdb100_2021Mar03"
Line 71:
From: -m $PIPEDIR/weights
To : -m $ROSETTA_DATA/weights
Line 85
From: conda activate folding
To : conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/Folding
If you are creating your own pipelines, you will not being able to successfully run the provided python scripts without activating the associated conda environment.
Multicore
Since the modeling pipeline is made up of a series of commands, you’ll need to inspect each of these commands to understand their multicore capabilities.
As a starting point, take a look through the run_[e2e/pyrosetta]_ver_arcc.sh
scripts, within which you'll see that the following variables are defined at the top:
CPU="8" # number of CPUs to use
MEM="64" # max memory (in GB)
The are then past as arguments into the commands called within the script.
Known Issue within miniconda3
There is a known issue loading miniconda3
and then creating an interactive salloc
session:
Known: undefined symbol: EVP_KDF_ctrl
[]$ module load miniconda3/23.1.0
[]$ salloc -A arcc -t 5:00
salloc: Granted job allocation 7174905
salloc: Waiting for resource configuration
salloc: Nodes ttest01 are ready for job
flatpak: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
To resolve this, call salloc
first, and then perform the module load miniconda3
: