Stacked Protein Prediction
Title: Stacked Protein Prediction
SNIC Project: Berzelius-2021-86
Project Type: LiU Berzelius
Principal Investigator: Ross King <rossk@chalmers.se>
Affiliation: Chalmers tekniska högskola
Duration: 2021-11-24 – 2022-06-01
Classification: 10601
Keywords:

Abstract

Activity title: Stacked Public Domain Protein Structure Prediction Method Better than AlphaFold Proposal lead: Ross D. King Summary and goals of the activity (2,500 characters max.) 1. Proposal In August 2021 two extremely impressive machine learning (ML) based protein structure prediction (PSP) methods were published and placed in the public domain (code & models): Baker (4), and AlphaFold (5,6). We propose to: 1) Use the Baker and AlphaFold PSP methods to predict the known experimentally determined protein structures (in PDB). This will test the reproducibility of both methods - which are extremely complex. Note the PDB predictions are not currently in the public domain, and many questions remain about their quality, as obvious errors exist. Output: A unique new public domain ML dataset for training PSP methods. 2) Apply the ML methods of stacking (7,8) and meta-ML (9) to the generated data (the predictions and experimentally determined structures) to learn how to combine AlphaFold PSP method with that of Baker, etc. to form a prediction method that is better than AlphaFold. Output: A stacked public domain PSP server more accurate than AlphaFold. 2. Background Solving PSP has been a ‘holy grail’ of science since the Noble prize-winning work of Anfinsen. An important step towards this goal is AlphaFold, probably the highest profile application of AI to science. This success is fundamentally driven by exponentially cheaper compute and DNA sequencing. We argue that best PSP method should be in the public domain, and freely available to all scientists. An almost guaranteed way to improve AlphaFold is to use stacking and meta-ML to combine AlphaFold with other PSP methods (esp. the Baker method). Stacking and Meta-ML are forms of ensemble ML, where multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model. In the case of PSP the baseline models are AlphaFold, the Bake method, SWISSMODEL, Phyre2, etc. All these PSP methods outperform AlphaFold on certain predictions. The normal input for stacking and meta-ML are classification or regression method. PSP methods output ‘structure’. How to best use stacking and meta-ML in this context is an open ML problem. References 1. King, R.D. (1987) In: Progress in Machine Learning. Sigma Press. 2. King, R.D. & Sternberg, M.J.E. (1990). J. Mol. Biol. 216, 441-457. 3. Xie, Z., … (2017) IEEE transactions on pattern analysis and machine intelligence, 40, 1903–1917 4. Baek, M. et al. (2021) Science 373, 871–876. 5. Jumper, J. et al. (2021) Nature 596, 583-589. 6. Tunyasuvunakool, K. et al. (2021) Nature 596, 590-596. 7. Wolpert (1992). Stacked Generalization. Neural Networks. 5 (2): 241–259. 8. Olier, .. King, R.D. (2021) Transformational Machine Learning Proc. Nat. Acad. Sci. U.S.A. (in press). 9. Olier, .. King, R.D. (2017). Machine Learning Journal. 107, 285-311 DDLS-WASP Priorities The proposal fits the DDLS and WASP priorities and builds on Sweden’s strength in AI and biomedicine. The proposal’s strategic vision is to develop the World’s best PSP service. It would be good if this was done in Sweden. If I can’t do this in Sweden I will approach Argonne National Lab where I have contacts and a huge computer. The application is timely: the Baker and AlphaFold prediction methods were published this month. The proposal is interdisciplinary: combing ML expertise, structural biology, and mathematics. The proposal will help train the next generation of AI scientists and structural biologists.