End-to-end learning for protein docking
Title: End-to-end learning for protein docking
SNIC Project: Berzelius-2021-29
Project Type: LiU Berzelius
Principal Investigator: Arne Elofsson <arne@bioinfo.se>
Affiliation: Stockholms universitet
Duration: 2021-08-25 – 2021-12-01
Classification: 10203
Homepage: https://bioinfo.se/


Central to Cell and Molecular Biology is to understand the function of biological macromolecules, in particular proteins. For this type of understanding and to develop drugs targeting proteins it is essential to know the structure of proteins. However, proteins do not act alone - they act together with other proteins. In December 2020, the results of CASP14 were presented, showing that the AlphaFold2 method developed by DeepMind could predict the structure of most proteins with an accuracy close to experimental structures. Recently, the method was released [1]and a database containing the predicted structure of almost all proteins has been released[1,2]. Therefore, we are convinced that the next major challenge is to use these types of methods for predicting protein-protein interactions. We have recently developed a “fold-and-dock” protocol, Pconsdock, based on an earlier structure prediction methodology[. Here, a “merged” multiple sequences alignment (MSA) is created using a heuristic approach to detect interacting orthologs within a species. We found that this methodology is on par with traditional docking methods - but complementary and having much promise. In this method, we identified the crucial factor determining the quality of the merged alignment. None of the heuristic methods we examined worked optimally for all cases. Here, AlphaFold2, as well as other methods do not use precomputed features from the MSA, instead, they use an embedding representation of the MSA, with an attention mechanism, similar to what has been developed in NLP models for proteins. In preliminary studies, we have found that this methodology provides some advantages also for protein-protein interactions (manuscript in preparation). Of particular promise are the MSA transformer and related models. The fundamental data underlying the rapid progress in protein structure predictions in the last decade is the use of coevolution between positions that are interacting in a protein. However, when using interactions between two proteins two MSAs needs to be merged (i.e. the interacting protein pairs needs to be identified). Here, we propose to build upon the MSA transformer, where an attention layer is applied to both rows and columns in a multiple sequence alignment. Here, we will start from a “complete” multiple sequence alignment containing all possible interacting pairs (this is about 100x larger than the compressed one as we only consider pairs from within the same species). Then we will optimize the transformer with a target function on inter-chain proteins interactions. This should be fully differentiable and therefore provide a much faster optimization. In the current implementation of the MSA transformer, the row-wise attention heads provide a very good proxy for interacting residues, so it is quite likely that these can be used as a quick way to optimize the performance. Secondly, we will start developing an end-to-end differentiable protein-protein docking using an iterative end-to-end approach with 3D-equivariant based on relative positions and orientations of local patterns in Fourier space-