Property transfer to protein sequences with self-supervised learning
Title: Property transfer to protein sequences with self-supervised learning
DNr: Berzelius-2024-205
Project Type: LiU Berzelius
Principal Investigator: Aleksej Zelezniak <>
Affiliation: Chalmers tekniska högskola
Duration: 2024-05-30 – 2024-12-01
Classification: 10203


The discovery of highly thermostable enzymes, such as polymerases, has been a cornerstone of modern biotechnology, enabling the development of robust biocatalysts essential for various industrial processes. Thermostability is a critical requirement for enzymes, allowing them to function effectively at elevated temperatures and over extended periods. Despite this, the widespread adoption of enzymatic processes in green biotechnology is limited by the inherent instability, low activity, and high production costs of enzymes compared to conventional chemical catalysts. These limitations impede the advancement of sustainable biotechnological processes. Traditional methods for producing thermostable enzymes, including experimental evolution and ancestral sequence reconstruction, are laborious and time-consuming. These approaches are particularly impractical for optimising rare enzyme sequences, such as those capable of degrading plastics requiring significant stability enhancements. This proposal introduces a novel generative AI approach, THOR (THermally Optimised Representations). THOR leverages self-supervised learning to develop generalised representations of enzyme thermostability without labels. This method resembles style transfer techniques that do not rely on paired label mapping, thus circumventing the data scarcity challenge prevalent in biological research. Our preliminary lab results show that THOR increased the melting temperatures of several enzymes (40 variants tested of 3 protein families, 32 successful) to as high as 90℃, including zero-shot style transfer on enzyme families that the neural network never encountered. Building on these promising results, the project aims to scale the THOR network's training across the entire Uniprot database, encompassing over 200 million protein sequences. Large protein language model foundational models are only trained by big tech companies (Meta, DeepMind, Salesforce), unfortunately limiting academic biological ML research to practically untested ideas. The computational capabilities of the Berzelius resource enable such scaling for the first time, allowing the development of the generalised model of thermostability transfer. Our laboratory has a proven track record in the development and experimental validation of advanced generative sequence methods for protein and DNA design. Notably, we have pioneered the use of attention-based deep convolutional networks trained with adversarial loss (Repecka et al. 2021, Nature Machine Intell), generative adversarial networks incorporating expectation maximisation to control gene expression (Zrimec et al. 2022; Zrimec et al. 2020 Nature Comm), and transformer-based techniques for navigating protein sequence spaces (Buric et al. 2023, preprint). Additionally, we have recently benchmarked the performance against experimental data of multiple state-of-the-art large protein language and structural models for functional protein design (Johnson et al. 2024, Nature Biotechnology). Likewise, we envision disseminating the results of this work in top-tier scientific journals such as Nature Biotechnology. In conclusion, the THOR approach represents a transformative leap in enzyme engineering, offering a scalable, data-efficient method to enhance enzyme thermostability. By leveraging cutting-edge AI techniques and large protein databases, this project aims to overcome the current limitations of enzyme-based biocatalysts, paving the way for more sustainable and economically viable biotechnological processes. We seek support to scale our innovative THOR framework, ensuring its broad applicability and impact across diverse protein families and industrial applications.