RoseTTAFold

Overview

  • RoseTTAFold: This package contains deep learning models and related scripts to run RoseTTAFold. This repository is the official implementation of RoseTTAFold: Accurate prediction of protein structures and interactions using a 3-track network.

    • GitHub:

    • PyRosetta: PyRosetta is an interactive Python-based interface to the powerful Rosetta molecular modeling suite. It enables users to design their own custom molecular modeling algorithms using Rosetta sampling methods and energy functions.

Using

The RoseTTAFold environment is a combination of conda environments, python packages, scripts, commands and data.

Although the RoseTTAFold environment can be used across a cluster, it is not currently designed to straight forwardly run on a cluster out of the box.

This page will suggest how to use locally and what ARCC infrastructure has provided to make using the provided pipelines more convenient.

Why Locally?

  • We have noticed during testing, that some of the provided scripts appear to require access and write to some internal child folders. This could potentially affect multiple users running concurrently from the same RoseTTAFold folder.

  • The provided pipeline scripts assume they are being running from within the RoseTTAFold install location.

Getting Started

Within your home or project folder, you will need to clone the main git repository and install the csblast and lddt dependencies:

# Clone python related scripts. [@blog2 testing]$ git clone https://github.com/RosettaCommons/RoseTTAFold.git [@blog2 testing]$ cd RoseTTAFold/ # Install csblast and lddt applications. [@blog2 RoseTTAFold]$ ./install_dependencies.sh

Module and sequence and structure database data

As details on the main RoseTTAFold GitHub page, you can download sequence and structure database data. For convenience we have downloaded this data that currently is >2.2T.

To expose this central location, and to allow future updates, use the module name rosettafold to discover the versions available. This will setup your environment with the following ROSETTA_DATA environment variable which can be used to access this pre-download data - for example:

[salexan5@ttest01 rosettafold]$ module load rosettafold/1.1.0 [salexan5@ttest01 rosettafold]$ ls -R $ROSETTA_DATA /pfs/tc1/udata/rosettafold/: bfd pdb100_2021Mar03 UniRef30_2020_06 weights /pfs/tc1/udata/rosettafold/bfd: bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex /pfs/tc1/udata/rosettafold/pdb100_2021Mar03: LICENSE pdb100_2021Mar03_a3m.ffindex pdb100_2021Mar03_cs219.ffindex pdb100_2021Mar03_hhm.ffindex pdb100_2021Mar03_pdb.ffindex pdb100_2021Mar03_a3m.ffdata pdb100_2021Mar03_cs219.ffdata pdb100_2021Mar03_hhm.ffdata pdb100_2021Mar03_pdb.ffdata /pfs/tc1/udata/rosettafold/UniRef30_2020_06: UniRef30_2020_06_a3m.ffdata UniRef30_2020_06_cs219.ffdata UniRef30_2020_06_hhm.ffdata UniRef30_2020_06.md5sums UniRef30_2020_06_a3m.ffindex UniRef30_2020_06_cs219.ffindex UniRef30_2020_06_hhm.ffindex /pfs/tc1/udata/rosettafold/weights: RF2t.pt Rosetta-DL_LICENSE.txt RoseTTAFold_e2e.pt RoseTTAFold_pyrosetta.pt

Conda Environments

The RoseTTAFold pipeline appears to be based upon two conda environments that ARCC infrastructure has pre-build.

To use, you will first need to module load miniconda3/23.1.0 and then activate either/or:

conda activate /apps/u/opt/conda-envs/rosettafold/1.1.0/RoseTTAFold

This conda environment provides access to the pyrosetta python package:

Provided pipeline script updates

As detailed on the main RoseTTAFold page “The modeling pipeline provided here (run_pyrosetta_ver.sh/run_e2e_ver.sh) is a kind of guidelines to show how RoseTTAFold works.

If you wish to use the provided pipelines on the Beartooth cluster using the pre-built conda environments and centralized data you will need to modify the above and following scripts:

If you are creating your own pipelines, you will not being able to successfully run the provided python scripts without activating the associated conda environment.

Multicore

Since the modeling pipeline is made up of a series of commands, you’ll need to inspect each of these commands to understand their multicore capabilities.

As a starting point, take a look through the run_[e2e/pyrosetta]_ver_arcc.sh scripts, within which you'll see that the following variables are defined at the top:

These are then past as arguments into the commands called within the script.

The values you use within your own scripts must match what you request via you salloc/sbatch calls.

Known Issue with miniconda3

There is a known issue loading miniconda3 and then creating an interactive salloc session:

To resolve this, call salloc first, and then perform the module load miniconda3: