End-to-end learning for protein docking
Title: End-to-end learning for protein docking
DNr: Berzelius-2022-230
Project Type: LiU Berzelius
Principal Investigator: Arne Elofsson <arne@bioinfo.se>
Affiliation: Stockholms universitet
Duration: 2022-12-01 – 2023-06-01
Classification: 10203
Homepage: https://bioinfo.se/


We are continuing our endeavors on predicting improved protein-protein interactions. So far we have mainly focused on developing alphafold. The progress in this field is rapid and exciting and our contribution has been significant the last year, largely thanks to computational resources provided by SNIC/KAW. In short we first developed the Fold and Dock pipeline (Bryant et al. 2022), then we applied this predict the structure of a large set of the human proteome (Burke et al. ), and finally we developed the MPC method to enable the prediction of large protein complexes (Bryant et al. ). This project has resulted in several papers in high-impact journals, see activity report. Our plans for the next period are twofold to modify (retrain) alphafold with the following focuses: (1) To improve predictions of large complexes we have several ideas. First, the reinforcement learning algorithm used in MPC does not currently handle small errors in the predictions of dimers, which limits its success rate. Therefore, we will investigate the use of a more flexible MCMC method based nor normal-mode analysis. Secondly, a general problem in this field is that we do in most cases not know the stoichiometry of a complex. We will therefore investigate both methods that can exhaustively search all stoichiometries and methods to predict these from additional data. Finally, we will investigate a problem where we are not able to distinguish the interactions between exact paralogs accurately by developing novel atomistic-based scoring functions. (2) Together with Bassot (WABI) we are planning to retrain alphafold to include constraints. This will be extremely useful for end-users as they could limit the models to the ones satisfying known constraints from mass-spec or Cryo-EM data, for instance. Retraining alphafold is hard (some reports claim it is possible to do with 512 GPUs in three days), therefore we will aim at using a transfer-learning strategy where the evoformer part of alphafold is untouched and only the smaller structural module is used. (3) We will also enable the possibility to model other molecules than proteins, in particular, RNA using a system similar to alphafold. This is a more long-time project, which is partly limited by available data. In addition to these method-developing projects, we continue our collaborative projects with more of a biological focus. For most of these projects, we expect the collaborators to apply for their own computing resources, but for some, we will help them to run alphafold (or other programs). I will just mention one project here, the predictions of toxin-antitoxin in collaboration with Gemma Atkinson have already provided novel biological understandings of this system - which is currently being tested in the lab.