Retraining Boltz (Hopper allocation)
Title: Retraining Boltz (Hopper allocation)
DNr: Berzelius-2026-48
Project Type: LiU Berzelius
Principal Investigator: Arne Elofsson <arne@bioinfo.se>
Affiliation: Stockholms universitet
Duration: 2026-03-05 – 2026-08-01
Classification: 10203
Homepage: https://bioinfo.se/
Keywords:

Abstract

This is not written as a continuation project, as we do not want to end the ongoing project, we would like to use the same directory as in our ongoing project Our research addresses fundamental challenges in computational protein science by developing and deploying large-scale deep learning methods for protein structure modelling and protein design. Describing and capturing protein structures as close as possible to their biological context is central to understanding mechanisms and enabling applications in biotechnology, immunology, and synthetic biology. One of the most fundamental and still only partially addressed open problems is predicting complexes between proteins and other biomolecules, including small molecules, metal ligands, DNA/RNA and regulatory peptides. Over the past two years, a series of all-atom generative models has been developed to predict proteins in complex with their non-proteinogenic ligands. However, even state-of-the-art models, such as AlphaFold3, Chai-1, and Boltz, exhibit distinct failure modes. While they can produce plausible structural models of acceptable quality, predictions are often invalid due to the violation of physics-based sanity checks. Among the commonly identified violations are interatomic clashes, incorrect chirality, large energetic penalties due to van der Waals volume overlaps, and invalid bond lengths or angles. To date, only one open-sourced model has attempted to address these shortcomings directly. In the Boltz implementation, the strategy to circumvent physical violations is an expensive inference-time guidance with a force-field-inspired potential, where multiple independent prediction trajectories must be sampled and evaluated using a reward function. Boltz significantly improves physical validity compared to other baselines, including AlphaFold3, but can still produce incorrect and unphysical predictions because the model never explicitly learned the meaning of the physical energy of the system. In this project, we propose to directly train all-atom generative models for protein structure to predict physics-based (not purely statistical) energies and forces, inspired by machine-learned interatomic potentials (MLIPs), rather than correcting unphysical structures at inference time. We hypothesise that explicitly learning an energy potential will help the model generalise in the low-data regime, produce more physically consistent biomolecular complexes, and enable energy steering for the inverse problem of protein complex design. To this end, we plan to build on Boltz and develop a training framework that encourages the model to learn interaction forces. Starting from Boltz weights trained only on denoised atomic coordinates, we will modify the structure-prediction modules. Specifically, we aim to integrate: Separate MLP layers with atom-index and translation invariance inspired by SchNet [5] to learn (conserved) energies and forces of molecular interactions from intermediate structural representations Joint training procedure on both denoising and DFT energy prediction on SPICE dipeptide and aminoacid / small molecule datasets Integration of the energy regression at inference-time Implementing these changes will require retraining the entire Boltz model (~650M parameters) on the complete dataset and benchmarking it through a series of ablations to investigate the contribution of training on energy potential to generalisation. If successful, the resulting model could become a new state of the art in biomolecular complex prediction.