Deep learning for protein research
Title: |
Deep learning for protein research |
DNr: |
Berzelius-2024-474 |
Project Type: |
LiU Berzelius |
Principal Investigator: |
Ingemar André <ingemar.andre@biochemistry.lu.se> |
Affiliation: |
Lunds universitet |
Duration: |
2025-01-01 – 2025-07-01 |
Classification: |
10203 |
Homepage: |
http://andrelab.lu.se |
Keywords: |
|
Abstract
The proposal contains describes three projects. In the first project we develop methods to design the three-dimensional structure of proteins. Natural proteins depend on their ability to change their structure, conformation, upon some sort of stimuli like binding to a molecule. To Able to do this they have to have alternative conformations with similar stability. Currently, there are no good methods to rationally design proteins with this property. Using generative AI protein design methods, we are developing a method to identify sequences that can fold into two distinct conformations and switch between. This is done by hallucinating two conformations and converging the sequences during an optimization method. Tests on a local computer and initial tests on Berzelius has demonstrated the ability of the method to identify these sequences. But to identify more sequence that can be tested experimentally in the lab we need to run a much larger number of simulations on a computer cluster. To get better control of the structure of the protein during the hallucination we are developing a deep learning method that can generate secondary structure topologies based on a language model, and then use this topology as constraint during structure generation.
The second project deals with training of a Large Language Model (LLM) on DNA for protein-coding genes. Our goal is to develop a LLM that can explain how the choice of DNA (codon) sequences is encoded for efficient folding of proteins in the cell. The LLM learns how codon choices are informed by the codon choices of surrounding residues, and this can the be correlated with information about the three-dimensional structure of the protein the gene codes for. This model will then we trained also using information about the three-dimensional structure of the proteins, enabling structure-guided codon design for optimal protein expression.
The third project involves using AlphaFold2 to validate designed protein sequence. We have developed a computational design method to design large symmetrical protein assemblies. To determine which sequences to experimentally characterize we fold sequences with AlphaFold2 at a large scale, predicting subunits and dimer, trimer and pentamers from the assembly.