Machine learning for protein structure prediction

System

NSC Web

Front Page

Getting Access

Support Email

support@nsc.liu.se

Feedback

Give Feedback

Machine learning for protein structure prediction

Title:	Machine learning for protein structure prediction
DNr:	Berzelius-2025-214
Project Type:	LiU Berzelius
Principal Investigator:	Arne Elofsson <arne@bioinfo.se>
Affiliation:	Stockholms universitet
Duration:	2025-08-01 – 2026-02-01
Classification:	10203
Homepage:	https://bioinfo.se/
Keywords:

Abstract

Our research team is at the cutting edge of computational protein science, dedicated to advancing our understanding of protein structure prediction, protein-protein interaction (PPI) detection, and protein design. These critical areas are not only essential for elucidating fundamental biological processes but also for unlocking transformative applications in biotechnology, medicine, and synthetic biology. By harnessing state-of-the-art computational tools like AlphaFold and OpenFold, which have revolutionized structural biology, we are committed to enhancing prediction accuracy, exploring innovative protein architectures, and generating valuable biological insights. A cornerstone of our research innovation is the development of Mamba, a sophisticated structured state-space model designed to replace the transformer-based attention mechanism in AlphaFold. Mamba addresses the significant bottlenecks related to long-sequence protein modeling by achieving linear scalability with respect to sequence length, which drastically reduces both inference time and memory usage. While implementing such cutting-edge innovations necessitates rigorous architectural modifications, our commitment to excellence drives us through the computationally intensive processes—often involving up to 32 GPUs over a week for retraining cycles. Additionally, we require processing on CPUs; therefore, our preference is to utilize the Tetralith system. The resources available at Tetralith are primarily allocated for pre- and post-processing data related to larger-scale analyses on the Berzelius supercomputer, as these jobs are not optimized for GPU use. However, we are currently facing challenges with disk space at Tetralith; thus, an increase in our disk quota would greatly facilitate our ongoing projects. Beyond structural predictions, our ambitious initiatives also focus on enhancing PPI detection and prediction capabilities. In collaboration with experimental partners, we are validating our computational predictions through native mass spectrometry (nMS) and cryo-electron tomography (cryo-ET). Our recent studies have concentrated on benchmarking methods to elevate the accuracy of homomeric and heteromeric interaction predictions while expanding our capabilities to incorporate RNA and other macromolecules. These endeavors are vital for constructing a comprehensive understanding of cellular machinery. Additionally, our contributions to computational methods have produced innovative techniques such as Hessian-Informed Flow Matching (HI-FM), which significantly refines the representation of molecular energy landscapes in stochastic systems. This approach has demonstrated success in modeling equilibrium distributions and holds remarkable potential for applications in molecular dynamics and small-molecule binding predictions. Collaboration is a fundamental pillar of our success. Our partnership with NBIS has enabled us to optimize pipelines for AlphaFold on the Berzelius supercomputer, including the development of a GPU-accelerated MMseq2 implementation. These innovations, along with several high-impact publications set to emerge in 2024, highlight our unwavering commitment to advancing the field of protein science. Achieving enhanced resource allocation on both Berzelius and NAISS will further empower us to overcome existing computational bottlenecks, hasten the pace of discovery, and solidify our competitive advantage in this rapidly evolving domain.

National Supercomputer Centre at Linköping University

Abstract