/
Alphafold

Alphafold

Overview

  • DeepMind: AlphaFold: can accurately predict 3D models of protein structures and has the potential to accelerate research in every field of biology.

Documentation

  • AlphaFold Protein Structure Database: Developed by DeepMind and EMBL-EBI

  • AlphaFold Colab: “This Colab notebook allows you to easily predict the structure of a protein using a slightly simplified version of AlphaFold v2.1.0.“

  • General Articles:

  • ARCC are NOT domain experts on the science behind using Alphafold. We can best effort support on errors you come across, but not the use of the flags and databases used by Alphafold.

    • Please share any feedback you have, and we will develop this page for the wider community.

Using

Use the module name alphafold to discover versions available and to load the application.

Loading the particular alphafold module version will appropriate set the following environment variables: ALPHADB and ALPHABIN and set the associated singularity module version.

Running Alphafold

Alphafold has been built via a docker image, but has been converted to a Singularity image, so must be run using Singularity.

Flag Help

As versions of alphafold update, available options will change. On loading the alphafold module, a full list of flags can be found by running:

singularity run -B .:/etc $ALPHABIN/alphafold220.sif --help singularity run -B .:/etc $ALPHABIN/alphafold220.sif --helpfull

Data Files and Examples

Version

Data Tree

Example

Version

Data Tree

Example

2.3.0

├── bfd │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata │ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex ├── mgnify │ └── mgy_clusters_2022_05.fa ├── params │ ├── LICENSE │ ├── params_model_1_multimer_v3.npz │ ├── params_model_1.npz │ ├── params_model_1_ptm.npz │ ├── params_model_2_multimer_v3.npz │ ├── params_model_2.npz │ ├── params_model_2_ptm.npz │ ├── params_model_3_multimer_v3.npz │ ├── params_model_3.npz │ ├── params_model_3_ptm.npz │ ├── params_model_4_multimer_v3.npz │ ├── params_model_4.npz │ ├── params_model_4_ptm.npz │ ├── params_model_5_multimer_v3.npz │ ├── params_model_5.npz │ └── params_model_5_ptm.npz ├── pdb70 │ ├── md5sum │ ├── pdb70_a3m.ffdata │ ├── pdb70_a3m.ffindex │ ├── pdb70_clu.tsv │ ├── pdb70_cs219.ffdata │ ├── pdb70_cs219.ffindex │ ├── pdb70_hhm.ffdata │ ├── pdb70_hhm.ffindex │ └── pdb_filter.dat ├── pdb_mmcif │ ├── mmcif_files │ └── obsolete.dat ├── pdb_seqres │ └── pdb_seqres.txt ├── uniprot │ └── uniprot.fasta ├── uniref30 │ ├── UniRef30_2021_03_a3m.ffdata │ ├── UniRef30_2021_03_a3m.ffindex │ ├── UniRef30_2021_03_cs219.ffdata │ ├── UniRef30_2021_03_cs219.ffindex │ ├── UniRef30_2021_03_hhm.ffdata │ ├── UniRef30_2021_03_hhm.ffindex │ └── UniRef30_2021_03.md5sums └── uniref90 └── uniref90.fasta 10 directories, 43 files
singularity run -B .:/etc --nv $ALPHABIN/alphafold.sif \ --fasta_paths=T1050.fasta \ --output_dir=./<output_folder> \ --model_preset=monomer \ --db_preset=full_dbs \ --bfd_database_path=$ALPHADB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --pdb70_database_path=$ALPHADB/pdb70/pdb70 \ --uniref30_database_path=$ALPHADB/uniref30/UniRef30_2021_03 \ --max_template_date=2020-05-14 \ --use_gpu_relax=<False|True> \ --data_dir=$ALPHADB \ --uniref90_database_path=$ALPHADB/uniref90/uniref90.fasta \ --mgnify_database_path=$ALPHADB/mgnify/mgy_clusters_2022_05.fa \ --template_mmcif_dir=$ALPHADB/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=$ALPHADB/pdb_mmcif/obsolete.dat

2.2.0

├── bfd │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex │ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata │ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex ├── mgnify │ └── mgy_clusters_2018_12.fa ├── params │ ├── LICENSE │ ├── params_model_1_multimer_v2.npz │ ├── params_model_1.npz │ ├── params_model_1_ptm.npz │ ├── params_model_2_multimer_v2.npz │ ├── params_model_2.npz │ ├── params_model_2_ptm.npz │ ├── params_model_3_multimer_v2.npz │ ├── params_model_3.npz │ ├── params_model_3_ptm.npz │ ├── params_model_4_multimer_v2.npz │ ├── params_model_4.npz │ ├── params_model_4_ptm.npz │ ├── params_model_5_multimer_v2.npz │ ├── params_model_5.npz │ └── params_model_5_ptm.npz ├── pdb70 │ ├── md5sum │ ├── pdb70_a3m.ffdata │ ├── pdb70_a3m.ffindex │ ├── pdb70_clu.tsv │ ├── pdb70_cs219.ffdata │ ├── pdb70_cs219.ffindex │ ├── pdb70_hhm.ffdata │ ├── pdb70_hhm.ffindex │ └── pdb_filter.dat ├── pdb_mmcif │ ├── mmcif_files │ └── obsolete.dat ├── pdb_seqres │ └── pdb_seqres.txt ├── small_bfd │ └── bfd-first_non_consensus_sequences.fasta ├── uniclust30 │ └── uniclust30_2018_08 ├── uniprot │ └── uniprot.fasta └── uniref90 └── uniref90.fasta 12 directories, 37 files
singularity run -B .:/etc --nv $ALPHABIN/alphafold.sif \ --use_gpu_relax=<False|True> \ --data_dir=$ALPHADB \ --bfd_database_path=$ALPHADB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --pdb70_database_path=$ALPHADB/pdb70/pdb70 \ --uniclust30_database_path=$ALPHADB/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --uniref90_database_path=$ALPHADB/uniref90/uniref90.fasta \ --mgnify_database_path=$ALPHADB/mgnify/mgy_clusters_2018_12.fa \ --template_mmcif_dir=$ALPHADB/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=$ALPHADB/pdb_mmcif/obsolete.dat \ --max_template_date=2020-05-14 \ --output_dir=./<output_folder> \ --fasta_paths=T1050.fasta \ --model_preset=monomer

Our test file T1050.fasta looks like this.

>T1050 A7LXT1, Bacteroides Ovatus, 779 residues| MASQSYLFKHLEVSDGLSNNSVNTIYKDRDGFMWFGTTTGLNRYDGYTFKIYQHAENEPGSLPDNYITDIVEMPDGRFWINTARGYVLFDKERDYFITDVTGFMKNLESWGVPEQVFVDREGNTWLSVAGEGCYRYKEGGKRLFFSYTEHSLPEYGVTQMAECSDGILLIYNTGLLVCLDRATLAIKWQSDEIKKYIPGGKTIELSLFVDRDNCIWAYSLMGIWAYDCGTKSWRTDLTGIWSSRPDVIIHAVAQDIEGRIWVGKDYDGIDVLEKETGKVTSLVAHDDNGRSLPHNTIYDLYADRDGVMWVGTYKKGVSYYSESIFKFNMYEWGDITCIEQADEDRLWLGTNDHGILLWNRSTGKAEPFWRDAEGQLPNPVVSMLKSKDGKLWVGTFNGGLYCMNGSQVRSYKEGTGNALASNNVWALVEDDKGRIWIASLGGGLQCLEPLSGTFETYTSNNSALLENNVTSLCWVDDNTLFFGTASQGVGTMDMRTREIKKIQGQSDSMKLSNDAVNHVYKDSRGLVWIATREGLNVYDTRRHMFLDLFPVVEAKGNFIAAITEDQERNMWVSTSRKVIRVTVASDGKGSYLFDSRAYNSEDGLQNCDFNQRSIKTLHNGIIAIGGLYGVNIFAPDHIRYNKMLPNVMFTGLSLFDEAVKVGQSYGGRVLIEKELNDVENVEFDYKQNIFSVSFASDNYNLPEKTQYMYKLEGFNNDWLTLPVGVHNVTFTNLAPGKYVLRVKAINSDGYVGIKEATLGIVVNPPFKLAAALQHHHHHH

If you have alternative examples, please share.

TPU Warnings

TPUs are Google's specialized ASICs and are thus not available on our NVidia GPUs. The following form of warnings can be ignored:

I0927 02:45:29.788146 47769932949376 tpu_client.py:54] Starting the local TPU driver. I0927 02:45:29.829137 47769932949376 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://

CPU Mode:

Slurm parameters and alphafold flag:

#SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G --use_gpu_relax=False

The mem value will depend on your data, please share your findings/observations.

Notice that neither GPUs/TPUs are detecting, so only running in CPU mode. Since running in CPU mode, your output will contain very slow compile messages

I0927 02:45:29.393396 47769932949376 templates.py:857] Using precomputed obsolete pdbs /pfs/tc1/udata/alphafold/data//pdb_mmcif/obsolete.dat. I0927 02:45:29.788146 47769932949376 tpu_client.py:54] Starting the local TPU driver. I0927 02:45:29.829137 47769932949376 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// 2022-09-27 02:45:29.888104: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected I0927 02:45:29.888254 47769932949376 xla_bridge.py:212] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices. I0927 02:45:29.888404 47769932949376 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. W0927 02:45:29.888490 47769932949376 xla_bridge.py:215] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) ... I0927 02:45:34.216590 47769932949376 run_alphafold.py:377] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0'] I0927 02:45:34.216804 47769932949376 run_alphafold.py:393] Using random seed 755889359826789066 for the data pipeline I0927 02:45:34.217036 47769932949376 run_alphafold.py:161] Predicting T1050 I0927 02:45:34.223562 47769932949376 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmpau6w5qyj/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 T1050.fasta /pfs/tc1/udata/alphafold/data//uniref90/uniref90.fasta" I0927 02:45:34.306553 47769932949376 utils.py:36] Started Jackhmmer (uniref90.fasta) query I0927 02:53:07.632456 47769932949376 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 453.326 seconds I0927 02:53:14.111809 47769932949376 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmpd_7zqsul/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 T1050.fasta /pfs/tc1/udata/alphafold/data//mgnify/mgy_clusters_2018_12.fa" I0927 02:53:14.180119 47769932949376 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query I0927 03:00:22.259188 47769932949376 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 428.079 seconds I0927 03:00:45.552681 47769932949376 hhsearch.py:85] Launching subprocess "/usr/bin/hhsearch -i /tmp/tmp9ipsqk6_/query.a3m -o /tmp/tmp9ipsqk6_/output.hhr -maxseq 1000000 -d /pfs/tc1/udata/alphafold/data//pdb70/pdb70" I0927 03:00:45.663152 47769932949376 utils.py:36] Started HHsearch query I0927 03:02:16.288284 47769932949376 utils.py:40] Finished HHsearch query in 90.625 seconds I0927 03:02:22.918217 47769932949376 hhblits.py:128] Launching subprocess "/usr/bin/hhblits -i T1050.fasta -cpu 4 -oa3m /tmp/tmp95ko2qyk/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /pfs/tc1/udata/alphafold/data//bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /pfs/tc1/udata/alphafold/data//uniclust30/uniclust30_2018_08/uniclust30_2018_08" I0927 03:02:23.029443 47769932949376 utils.py:36] Started HHblits query I0927 04:49:12.117560 47769932949376 utils.py:40] Finished HHblits query in 6409.088 seconds I0927 04:49:13.281963 47769932949376 templates.py:878] Searching for template for: MASQSYLFKHLEVSDGLSNNSVNTIYKDRDGFMWFGTTTGLNRYDGYTFKIYQHAENEPGSLPDNYITDIVEMPDGRFWINTARGYVLFDKERDYFITDVTGFMKNLESWGVPEQVFVDREGNTWLSVAGEGCYRYKEGGKRLFFSYTEHSLPEYGVTQMAECSDGILLIYNTGLLVCLDRATLAIKWQSDEIKKYIPGGKTIELSLFVDRDNCIWAYSLMGIWAYDCGTKSWRTDLTGIWSSRPDVIIHAVAQDIEGRIWVGKDYDGIDVLEKETGKVTSLVAHDDNGRSLPHNTIYDLYADRDGVMWVGTYKKGVSYYSESIFKFNMYEWGDITCIEQADEDRLWLGTNDHGILLWNRSTGKAEPFWRDAEGQLPNPVVSMLKSKDGKLWVGTFNGGLYCMNGSQVRSYKEGTGNALASNNVWALVEDDKGRIWIASLGGGLQCLEPLSGTFETYTSNNSALLENNVTSLCWVDDNTLFFGTASQGVGTMDMRTREIKKIQGQSDSMKLSNDAVNHVYKDSRGLVWIATREGLNVYDTRRHMFLDLFPVVEAKGNFIAAITEDQERNMWVSTSRKVIRVTVASDGKGSYLFDSRAYNSEDGLQNCDFNQRSIKTLHNGIIAIGGLYGVNIFAPDHIRYNKMLPNVMFTGLSLFDEAVKVGQSYGGRVLIEKELNDVENVEFDYKQNIFSVSFASDNYNLPEKTQYMYKLEGFNNDWLTLPVGVHNVTFTNLAPGKYVLRVKAINSDGYVGIKEATLGIVVNPPFKLAAALQHHHHHH I0927 04:49:15.634872 47769932949376 templates.py:268] Found an exact template match 4a2m_B. I0927 04:49:19.196634 47769932949376 templates.py:268] Found an exact template match 4a2l_F. I0927 04:49:21.619388 47769932949376 templates.py:268] Found an exact template match 3v9f_B. ... I0927 04:49:28.551563 47769932949376 pipeline.py:234] Uniref90 MSA size: 10000 sequences. I0927 04:49:28.551765 47769932949376 pipeline.py:235] BFD MSA size: 4961 sequences. I0927 04:49:28.551838 47769932949376 pipeline.py:236] MGnify MSA size: 501 sequences. I0927 04:49:28.551913 47769932949376 pipeline.py:238] Final (deduplicated) MSA size: 15396 sequences. I0927 04:49:28.552187 47769932949376 pipeline.py:241] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20. I0927 04:49:28.709889 47769932949376 run_alphafold.py:190] Running model model_1_pred_0 on T1050 I0927 04:49:43.389027 47769932949376 model.py:166] Running predict with shape(feat) = {'aatype': (4, 779), 'residue_index': (4, 779), 'seq_length': (4,), 'template_aatype': (4, 4, 779), 'template_all_atom_masks': (4, 4, 779, 37), 'template_all_atom_positions': (4, 4, 779, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 779), 'msa_mask': (4, 508, 779), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 779, 3), 'template_pseudo_beta_mask': (4, 4, 779), 'atom14_atom_exists': (4, 779, 14), 'residx_atom14_to_atom37': (4, 779, 14), 'residx_atom37_to_atom14': (4, 779, 37), 'atom37_atom_exists': (4, 779, 37), 'extra_msa': (4, 5120, 779), 'extra_msa_mask': (4, 5120, 779), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 779), 'true_msa': (4, 508, 779), 'extra_has_deletion': (4, 5120, 779), 'extra_deletion_value': (4, 5120, 779), 'msa_feat': (4, 508, 779, 49), 'target_feat': (4, 779, 22)} 2022-09-27 04:53:47.880350: E external/org_tensorflow/tensorflow/compiler/xla/service/slow_operation_alarm.cc:55] ******************************** Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results. Compiling module jit_apply_fn.149867 ******************************** I0927 08:25:45.766484 47769932949376 model.py:176] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (779, 779, 64)}, 'experimentally_resolved': {'logits': (779, 37)}, 'masked_msa': {'logits': (508, 779, 23)}, 'predicted_lddt': {'logits': (779, 50)}, 'structure_module': {'final_atom_mask': (779, 37), 'final_atom_positions': (779, 37, 3)}, 'plddt': (779,), 'ranking_confidence': ()} I0927 08:25:45.767058 47769932949376 run_alphafold.py:204] Total JAX model model_1_pred_0 on T1050 predict time (includes compilation time, see --benchmark): 12962.4s I0927 08:26:04.640739 47769932949376 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {<Residue 778 (HIS) of chain 0>: ['OXT']}, 'Se_in_MET': [], 'removed_chains': {0: []}} I0927 08:26:05.683431 47769932949376 amber_minimize.py:408] Minimizing protein, attempt 1 of 100. I0927 08:26:07.632116 47769932949376 amber_minimize.py:69] Restraining 6213 / 12189 particles. I0927 08:29:42.808396 47769932949376 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {}, 'Se_in_MET': [], 'removed_chains': {0: []}} I0927 08:29:52.802619 47769932949376 amber_minimize.py:500] Iteration completed: Einit 27057.63 Efinal -16966.08 Time 205.71 s num residue violations 0 num residue exclusions 0 I0927 08:30:07.146561 47769932949376 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {<Residue 778 (HIS) of chain 0>: ['OXT']}, 'Se_in_MET': [], 'removed_chains': {0: []}} I0927 08:30:10.541722 47769932949376 run_alphafold.py:190] Running model model_2_pred_0 on T1050 ...

GPU Mode:

With: use_gpu_relax=True

Slurm parameters and alphafold flag:

#SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G #SBATCH --gres=gpu:p100:1 #SBATCH --partiton=teton-gpu --use_gpu_relax=True

The P100 GPUs do not have TPU capabilities, so expect to see the Unable to initialize backend 'tpu' message. The job will finish with a Final timings message that lists timings for all the number of models details at the start - so in this example 5.

I0927 03:04:41.707442 46916878432128 tpu_client.py:54] Starting the local TPU driver. I0927 03:04:41.748176 46916878432128 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I0927 03:04:41.885981 46916878432128 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. I0927 03:04:45.971406 46916878432128 run_alphafold.py:377] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0'] I0927 03:04:45.971634 46916878432128 run_alphafold.py:393] Using random seed 1618434774479156613 for the data pipeline ... I0927 05:43:56.501951 46916878432128 run_alphafold.py:271] Final timings for T1050: {'features': 6624.717063903809, 'process_features_model_1_pred_0': 14.767481803894043, 'predict_and_compile_model_1_pred_0': 584.4838185310364, 'relax_model_1_pred_0': 66.8146915435791, 'process_features_model_2_pred_0': 10.930636644363403, 'predict_and_compile_model_2_pred_0': 521.780042886734, 'relax_model_2_pred_0': 50.50807785987854, 'process_features_model_3_pred_0': 10.355294227600098, 'predict_and_compile_model_3_pred_0': 504.99356603622437, 'relax_model_3_pred_0': 55.80493640899658, 'process_features_model_4_pred_0': 10.345772743225098, 'predict_and_compile_model_4_pred_0': 502.47756576538086, 'relax_model_4_pred_0': 50.57872271537781, 'process_features_model_5_pred_0': 10.470243215560913, 'predict_and_compile_model_5_pred_0': 461.3752865791321, 'relax_model_5_pred_0': 53.00025510787964}
With: use_gpu_relax=False
#SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G #SBATCH --gres=gpu:p100:1 #SBATCH --partiton=teton-gpu --use_gpu_relax=False

At this stage, we haven’t noticed any observable difference with setting use_gpu_relax to True or False.

Looking at the flag help:

--[no]run_relax: Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage. (default: 'true') ... --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if this setting is enabled.
  • The test that we are running might not be performing the “final relaxation step on the predicted models“ which is why we’re not seeing any significant difference.

  • And/Or this might be because it uses the tensor core capabilities of the GPU, which we do not have on the P100s.

Current Recommendations:

  • Our test file appears to require at least 64G, otherwise we have run into out of memory issues.

  • If using a GPU, we have only had successes using the P100/A30s.

  • The P100s do NOT have tensor cores.

    • Using 2 GPUs showed no speed increase over one.

  • Although our V100s do have tensor cores, we have not had any successful tests. We believe this is due to the current NVidia drivers / cuda versions - we have ongoing testing for these.

Run Times: Observations

The timings below are all using the same dataset (see below) but each will have a different random seed so the runs are not deterministic (i.e. they have an element of randomness) so we can not expect the same resource allocation to run in the same time.

  • Within a CPU only mode, 16 cores appears to run faster than 8 or 32.

  • The cascade nodes are generally faster than the teton nodes - to be expected as they have a newer chipset.

  • The teton-gpu P100s on are significantly faster. We only have 8 of these so expecting jobs to be queued.

These timings are only a small subset of possible resource dimensions that can be changed within a job submission, but provide some basic insight into what to consider. Also, consider that we have 180 teton nodes and 56 cascade nodes, so if you have time to run simulations (e.g. over the weekend) then you can submit a lot more jobs that have a better chance of not being queued than simply requesting the P100s and potentially having your jobs queued for hours/days.

Partition

cores:8

cores:16

cores:32

Partition

cores:8

cores:16

cores:32

teton: cpu

20:28:1

18:54:09

13:15:21

19:50:55

teton-cascade: cpu

16:02:35

10:56:07

11:34.45

12:07:54

12:06:52

12:08:08

teton-gpu: 1 p100

02:38:10

02:49:23

03:04:28

02:43:51

02:39:39

02:53:48

02:53:40

02:58:28

 

beartooth-gpu: 1 a30

 

02:39:51

 

 

 

Related content