Machine learning for protein structure prediction
Title: Machine learning for protein structure prediction
DNr: Berzelius-2026-12
Project Type: LiU Berzelius
Principal Investigator: Arne Elofsson <arne@bioinfo.se>
Affiliation: Stockholms universitet
Duration: 2026-02-01 – 2026-08-01
Classification: 10203
Homepage: https://bioinfo.se/
Keywords:

Abstract

Our research addresses fundamental challenges in computational protein science by developing and deploying large-scale machine-learning methods for protein structure prediction, protein–protein interaction (PPI) modelling, and protein design. These problems are central to understanding biological mechanisms and to enabling applications in biotechnology, immunology, and synthetic biology, yet they remain computationally demanding and are poorly addressed by current confidence metrics and sampling strategies. Access to Berzelius is therefore essential to both method development and large-scale inference. A major focus of our work is the development of a machine-learning framework to improve model selection for AlphaFold-predicted antibody–antigen (Ab–Ag) complexes. AlphaFold 3 generates large ensembles of decoy structures, but its native confidence scores often fail to identify biologically correct binding modes, particularly for Ab–Ag systems where binding is governed by highly flexible complementarity-determining region (CDR) loops. We are developing a specialised model that operates directly on ensembles of AlphaFold decoys and exploits inter-chain Predicted Aligned Error (PAE) patterns to discriminate native-like interfaces from non-native ones. This requires large-scale generation of AlphaFold decoy ensembles across diverse Ab–Ag systems, followed by GPU-intensive training, hyperparameter optimisation, and cross-validation. Berzelius is uniquely suited for this workload, combining Ampere nodes for high-throughput structure generation with Hopper nodes for large-scale model training. Beyond Ab–Ag modelling, we conduct extensive research on PPI prediction, including the development and benchmarking of methods for homodimer structure prediction, where Berzelius has already been instrumental. In parallel, we are generating large-scale RNA 3D structure ensembles for thousands of RNA families in the Rfam database that currently lack experimentally resolved structures. This project requires sustained GPU inference and large-scale ensemble analysis, leading to substantial computational and storage demands. Methodologically, we contribute new generative modelling approaches for biomolecular systems. We have developed Hessian-Informed Flow Matching (HI-FM) to better represent molecular energy landscapes in stochastic systems, with demonstrated success for equilibrium modelling and strong potential for molecular dynamics and docking applications. We have also pioneered the first SE(3)-equivariant flow-matching framework for protein design, enabling controllable generation of novel protein structures. Ongoing work extends this framework to conditional generation, atomic-level side-chain flexibility, and the design of functional protein catalysts. In parallel, we are developing efficient equivariant graph neural networks that predict protein flexibility and conformational ensembles directly from structure, emulating costly molecular dynamics simulations. Our work is strongly collaborative and infrastructure-driven. Together with NBIS, we have optimised AlphaFold pipelines on Berzelius, including a GPU-accelerated MMseqs2 implementation. We maintain active collaborations with KTH and the Max Planck Institute for Polymer Research, including visiting researchers currently using Berzelius resources. Continued and expanded access to Berzelius is critical to overcoming current computational bottlenecks and sustaining our internationally competitive research programme in computational protein science.