RoseTTAFold
Overview
RoseTTAFold: This package contains deep learning models and related scripts to run RoseTTAFold. This repository is the official implementation of RoseTTAFold: Accurate prediction of protein structures and interactions using a 3-track network.
Using
The RoseTTAFold environment is a combination of conda environments, python packages, scripts, commands and data.
Although the RoseTTAFold environment can be used across a cluster, it is not currently designed to straight forwardly run on a cluster out of the box.
This page will suggest how to use locally and what ARCC infrastructure has provided to make using the provided pipelines more convenient.
Why Locally?
We have noticed during testing, that some of the provided scripts appear to require access and write to some internal child folders. This could potentially affect multiple users running concurrently from the same RoseTTAFold folder.
The provided pipeline scripts assume they are being running from within the
RoseTTAFold
install location.
Getting Started
Within your home or project folder, you will need to clone the main git repository and install the csblast
and lddt
dependencies:
# Clone python related scripts.
[@blog2 testing]$ git clone https://github.com/RosettaCommons/RoseTTAFold.git
[@blog2 testing]$ cd RoseTTAFold/
# Install csblast and lddt applications.
[@blog2 RoseTTAFold]$ ./install_dependencies.sh
Module and sequence and structure database data
As details on the main RoseTTAFold GitHub page, you can download sequence and structure database data. For convenience we have downloaded this data that currently is >2.2T.
To expose this central location, and to allow future updates, use the module name rosettafold
to discover the versions available. This will setup your environment with the following ROSETTA_DATA
environment variable which can be used to access this pre-download data - for example:
Conda Environments
The RoseTTAFold pipeline appears to be based upon two conda environments that ARCC infrastructure has pre-build.
To use, you will first need to module load miniconda3/23.1.0
and then activate either/or:
conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/RoseTTAFold
This conda environment provides access to the pyrosetta
python package:
Provided pipeline script updates
As detailed on the main RoseTTAFold page “The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works.”
If you wish to use the provided pipelines on the Beartooth cluster using the pre-built conda environments and centralized data you will need to modify the above and following scripts:
If you are creating your own pipelines, you will not being able to successfully run the provided python scripts without activating the associated conda environment.
Multicore
Since the modeling pipeline is made up of a series of commands, you’ll need to inspect each of these commands to understand their multicore capabilities.
As a starting point, take a look through the run_[e2e/pyrosetta]_ver_arcc.sh
scripts, within which you'll see that the following variables are defined at the top:
These are then past as arguments into the commands called within the script.
The values you use within your own scripts must match what you request via you salloc
/sbatch
calls.
Known Issue with miniconda3
There is a known issue loading miniconda3
and then creating an interactive salloc
session:
To resolve this, call salloc
first, and then perform the module load miniconda3
: