Machine learning for protein structure prediction
| Title: |
Machine learning for protein structure prediction |
| DNr: |
Berzelius-2026-197 |
| Project Type: |
LiU Berzelius |
| Principal Investigator: |
Arne Elofsson <arne@bioinfo.se> |
| Affiliation: |
Stockholms universitet |
| Duration: |
2026-08-01 – 2027-02-01 |
| Classification: |
10203 |
| Homepage: |
https://bioinfo.se/ |
| Keywords: |
|
Abstract
Our research addresses fundamental challenges in computational protein science by developing and deploying large-scale machine-learning methods for biomolecular structure prediction, protein–protein interaction modelling, and protein design. Describing proteins and biomolecular complexes in their biological context is central to understanding molecular mechanisms and enabling applications in biotechnology, immunology, and synthetic biology. These problems remain computationally demanding, particularly when current models generate large structural ensembles, require extensive sampling, or fail to capture physical constraints. Continued access to Berzelius is therefore essential for both large-scale inference and method development.
A major focus is on improving model selection for AlphaFold-predicted antibody–antigen complexes. AlphaFold 3 can generate ensembles of plausible decoy structures, but its native confidence scores often fail to identify biologically correct binding modes, especially in antibody–antigen systems where recognition depends on flexible complementarity-determining region loops. We have developed a specialised machine-learning model that operates directly on AlphaFold decoy ensembles and uses inter-chain Predicted Aligned Error patterns to distinguish native-like from non-native interfaces. This work required large-scale generation of decoy ensembles across diverse antibody–antigen systems, followed by GPU-intensive training, hyperparameter optimisation, and cross-validation. The first manuscript has been submitted and is currently under major revision, while follow-up studies are ongoing.
Beyond antibody–antigen modelling, we continue to develop and benchmark methods for protein–protein interaction prediction, including homodimer structure prediction, where Berzelius has already been instrumental. We are also generating large-scale RNA 3D-structure ensembles for thousands of Rfam families that lack experimentally resolved structures. Most inference for the first study has been completed, and the next step is to retrain Boltz-2, or potentially AlphaFold 3, depending on licensing, to better account for ion-dependent RNA structure prediction.
A second major direction is the development of physically informed all-atom generative models for biomolecular complexes. Recent models such as AlphaFold 3, Chai-1 and Boltz can produce plausible protein–ligand and protein–nucleic acid structures, but predictions may remain physically invalid because of steric clashes, incorrect chirality, unrealistic bond geometry or high energetic penalties. Current approaches partly address this with expensive inference-time guidance, but the models themselves do not explicitly learn the system's physical energy. We therefore aim to build on Boltz by training all-atom generative models to predict physics-based energies and forces, inspired by machine-learned interatomic potentials. This will require modifying the structure-prediction modules, integrating invariant energy and force-prediction layers, jointly training on denoising and DFT-derived energy data, and retraining the full Boltz model, comprising approximately 650 million parameters, with extensive ablation studies.
Methodologically, we also develop Hessian-Informed Flow Matching for molecular energy landscapes, SE(3)-equivariant flow matching for protein design, and efficient equivariant graph neural networks for predicting flexibility and conformational ensembles. Together with NBIS, KTH and international collaborators, we have optimised AlphaFold-related workflows on Berzelius. Continued access to Ampere and Hopper nodes is critical for sustaining this internationally competitive research programme.