Deep learning for protein research

System

NSC Web

Front Page

Getting Access

Support Email

support@nsc.liu.se

Feedback

Give Feedback

Deep learning for protein research

Title:	Deep learning for protein research
DNr:	Berzelius-2023-361
Project Type:	LiU Berzelius
Principal Investigator:	Ingemar André <ingemar.andre@biochemistry.lu.se>
Affiliation:	Lunds universitet
Duration:	2023-12-14 – 2024-07-01
Classification:	10203
Homepage:	http://andrelab.lu.se
Keywords:

Abstract

The proposal contains describes two projects. In the first project we develop methods to design the three-dimensional structure of proteins. Natural proteins depend on their ability to change their structure, conformation, upon some sort of stimuli like binding to a molecule. To Able to do this they have to have alternative conformations with similar stability. Currently, there are no good methods to rationally design proteins with this property. Using generative AI protein design methods, we are developing a method to identify sequences that can fold into two distinct conformations and switch between. This is done by hallucinating two conformations and converging the sequences during an optimization method. Tests on a local computer has demonstrated the ability of the method to identify these sequences. But to identify sequence that can be tested experimentally in the lab we need to run a much larger number of simulations on a computer cluster. The second project deals with training of a Large Language Model (LLM) on DNA for protein-coding genes. Our goal is to develop a LLM that can explain how the choice of DNA (codon) sequences is encoded for efficient folding of proteins in the cell. The LLM learns how codon choices are informed by the codon choices of surrounding residues, and this can the be correlated with information about the three-dimensional structure of the protein the gene codes for. This model will then we trained also using information about the three-dimensional structure of the proteins, enabling structure-guided codon design for optimal protein expression.

National Supercomputer Centre at Linköping University

Abstract