Protein Structure Prediction using contact predictions and other tools
Title: Protein Structure Prediction using contact predictions and other tools
DNr: SNIC 2015/10-12
Project Type: SNIC Large Compute
Principal Investigator: Arne Elofsson <arne@bioinfo.se>
Affiliation: Stockholms universitet
Duration: 2015-07-01 – 2016-07-01
Classification: 10203 10601 30199
Homepage: http://bioinfo.se/
Keywords:

Abstract

Here, we apply for resources to continue our development and utilisation of methods for protein structure predictions. We have during the last two years developed methods that significantly outperform earlier contact predictions methods. In particular the development of the second method, PconsC2, is our mind quite innovative. We have now reached the state of these methods that we are starting to apply them on a massive scale with the ultimate goal to predict the structure of all proteins in a cell. For preliminary results see http://ae.scilifelab.se/pfam.mirco.dTA/ In PconsC2 (and C3), we use a deep learning approach to significantly improve contact predictions. The improvement is 50% better PPV values than our earlier methods and almost 80% better than any other method. We have also developed a method to fold protein, Pcons Fold and are currently improving it. Although the basic methods work there is still development work needed. In addition to making better models there are a number of "special cases" (repeat proteins, protein interactions, disordered proteins etc) for which we have to improve the methods. In short many of these prote boil down to (i) inaccurate multiple sequence alignments (ii) inaccurate identification of orthologous proteins and difficulty to fold large proteins. We are just starting examining the possibilities to address these problems. My group consist of 10-15 people and many of these students use heavy computing at least partially during their projects. As you can see from our activity logs, many months we have used our allocated resources (at least on one of the systems) already within a few days of the month. This has created severe bottlenecks in our development. Luckily we have still had access to ferlin and cloud resources (at PDC and the europan cloud) and used this as an emergency system. We have also used allocations from our collaborators. This is a natural part of a research group that is focused on methods development. However, the work on having to run on 5 different systems causes extra work that would be very convenient to avoid. Given that on average perhaps 5 students are developing methods, and for efficient turnaround they need to have access to about 10 nodes each 24/7, we are in the need of approximately 900k core hours. As can be seen we have many days used about 30k hours per day both on tintin and triolith, i.e. with 50% days like that we would normally use about 900k core hours. One addition to last year is that we plan to start using GPU based deep learning strategies for two problems, contact predictions as in PconsC and quality prediction in ProQ. Therefore we also ask for time to examine these possibilities.