Deep learning for protein - DNA binding predictions
Title: |
Deep learning for protein - DNA binding predictions |
DNr: |
Berzelius-2025-120 |
Project Type: |
LiU Berzelius |
Principal Investigator: |
Emil Marklund <emil.marklund@scilifelab.se> |
Affiliation: |
Stockholms universitet |
Duration: |
2025-03-24 – 2025-10-01 |
Classification: |
10307 |
Homepage: |
https://marklundlab.com/ |
Keywords: |
|
Abstract
Specific recognition and binding of nucleic acids by regulatory proteins control cellular processes in all life forms. While structural predictions using deep learning methods like AlphaFold have revolutionized our understanding of sequence dependent molecular structure, we have historically had very limited data on how the genetic code specifies binding rates and affinities for interacting molecules. This has prevented us from quantitatively predicting these parameters at arbitrary locations in the genome, and thus limits our ability to understand the sequence and structure dependent gene-regulatory effect of a given biomolecule. In general, we are currently unable to quantitatively predict molecular recognition and function from a nucleotide or protein sequence. Gaining this quantitative understanding is of general interest in all of molecular biology, and of special interest for transcription factor (TF) - DNA binding. This binding controls gene expression, and can cause a myriad of disease states. For example, more than 50% of disease and trait associated genome wide association study variants are found in cis regulatory regions of the genome, like promoter and enhancers, and transcription factors are mutated in most human cancers. Gaining a more quantitative understanding of transcription factor - DNA binding will thus likely be necessary before we can cure these diseases.
To understand the sequence dependence on macromolecular binding we need to train quantitative models on high quality biophysical binding data for many sequence mutants. In my lab, we generate this type of data on massively parallel arrays, through 'high-throughput sequencing'-'fluorescent ligand interaction profiling' (HiTS-FLIP). In these assays, we use a second generation sequencing instrument to do affinity and/or kinetic measurements on hundreds of thousands of DNA sequence mutants at the same time. We have recently generated a dataset of binding rates (ka) unbinding rates (kd) and binding equilibrium constants (KD) for the human transcription factor KLF1 binding to a diverse library of 22,349 DNA sequence variants.
In the current project, we will train a structure based deep learning model to understand and predict binding affinities for zinc finger TF - DNA interactions, using the HiTS-FLIP binding data for KLF1.
Compared to sequence based models, structure based models will allow us to understand what structural features matter for the protein - DNA interaction, and importantly, have great potential to be generalizable to more protein and DNA sequences and structures. We will use an open source variant of AlphaFold3 to predict structures of the protein - DNA complexes, used as input to our model. In a later stage, we will also generate different plausible binding poses using diffusion models, where using this ensemble of structural conformations as input could improve affinity predictions.