Predicting Protein Conformational Changes Using Machine Learning
Title: Predicting Protein Conformational Changes Using Machine Learning
SNIC Project: Berzelius-2021-4
Project Type: LiU Berzelius
Principal Investigator: Lynn Kamerlin <>
Affiliation: Uppsala universitet
Duration: 2021-08-18 – 2022-03-01
Classification: 10203


The application of machine learning (ML) methods towards various challenges in the field of biochemistry has notably increased in recent years. This is certainly true in the field of computational protein/enzyme engineering, in which ML has been applied (among other things) to predict protein structure; improve the quality of force field parameters; and characterize complex allosteric networks. With our proposal we aim to apply ML towards the large amount of data available within molecular dynamics (MD) simulations of enzyme systems that we wish to engineer by identifying candidate mutations. Whilst ML has previously been applied to characterize protein/enzyme allostery, one of the key differences in our approach to other known approaches is the nature of the input features that we utilize. That is, our initial input feature set will be all the non-covalent interactions present in the entire enzyme scaffold over the course of our MD simulations (total size in the hundreds or low thousands). With this, we can develop a supervised ML pipeline that will directly identify the key hydrogen bonds, salt bridges, vdW contacts (etc...) throughout the entire enzyme scaffold that modulate the conformational dynamics of the key catalytic residues that ultimately control catalysis. This approach will therefore provide us with the knowledge of specific interactions distributed throughout the enzyme that we can target through the insertion of specific mutations in order to re-engineer the enzyme scaffold as we desire. Furthermore, we can leverage phylogenetic analysis to filter candidate mutations proposed by our ML model, with mutations that are rare or absent in homologous enzymes being avoided as they are likely to be destabilizing. The methodologies and protocols we will develop here can be readily generalized beyond enzyme engineering to many other protein engineering challenges (such as drug discovery), and these will be made freely available to the global academic community. At present, we have run the necessary MD simulations (which have extensive experimental data that we can ultimately benchmark on) and built and constructed ML models which show good promise with relatively cheap (resourcewise) ML methods (such as XGBoost). However, we feel that our approach would benefit from some more resource intensive deep learning approaches. This is because our main goal (beyond developing a model that is of course predictive) is to utilize feature importance for engineering and would therefore ideally not like to remove/filter the features used to build our models (to get a more complete picture of how the entire scaffold modulates the key conformational components of the enzyme). We would also like to note that there are tools designed to provide model independent descriptions of feature importance such as LIME and SHAP, meaning we can readily apply deep learning to learn about our system.