Deep learning for high-throuput data
Title: Deep learning for high-throuput data
DNr: NAISS 2024/6-246
Project Type: NAISS Medium Storage
Principal Investigator: Mika Gustafsson <mika.gustafsson@liu.se>
Affiliation: Linköpings universitet
Duration: 2024-08-30 – 2025-09-01
Classification: 10203
Homepage: https://gitlab.com/Gustafsson-lab
Keywords:

Abstract

Our objective is to develop a robust multi-omic data integration tool that captures disease-specific structures across various biological data levels, facilitating the identification of processes related to disease outcome severity and treatment responses. This proposal outlines our planned advancements and the specific steps we will undertake to achieve these goals. Complex diseases often result in common drugs being ineffective for certain sub-groups of patients due to the interplay of numerous small-effect genetic and epigenetic factors. New biotechnological methods, particularly omics, have made it possible to measure the molecular imprints of entire cells, paving the way for more individualized therapies. Deep auto-encoders (DAEs), a type of artificial neural network, have recently emerged as effective tools for summarizing high-dimensional genomics data through flexible non-linear dimension reduction. We will update our existing RNA-seq auto-encoders by incorporating over 300,000 samples. This enhancement will leverage the latest methodologies to improve the accuracy and robustness of our disease outcome predictions (Dwivedi et al., manuscript). Additionally, we will create functional embeddings for DNA methylation (Martinez et al., Breifings in Bioinformatics). These embeddings will provide a comprehensive view of the epigenetic landscape, enabling the identification of key biomarkers. In collaboration with Prof. Jörnsten from Chalmers, we will enhance our data integration tool by incorporating protein interaction networks more efficiently. Utilizing a new optimization procedure, we aim to increase the functional relevance of our embeddings (Dwivedi et al., draft manuscript). Working with Prof. Wallner and Assoc. Prof. Mirabello, we will integrate alpha-FOLD predictions into our embeddings. This integration is expected to improve the coverage and accuracy of protein interactions, reducing study-bias. We have mapper approximately 25,000 interactions already, but aim to increase this number above 1 million (Sandås, ongoing work). In order to to go from the alpha-FOLD predictions to interactions we will train a convolutional neural network which will give confidence numbers to the interactions. The DNA methylation embeddings will be utilized within a KAW-funded prototype project (Gustafsson, WALP) to serve as rich features for predictive health. This application aims to build a trait- and disease risk classifier platform that interprets stress marks in DNA. Our proposed work aims to leverage cutting-edge biotechnology and machine learning techniques to develop a comprehensive multi-omic data integration tool. By enhancing our current methodologies and incorporating new data types and interactions, we aim to significantly improve the accuracy and robustness of disease outcome predictions and treatment response analyses. This proposal outlines the critical steps we will undertake to achieve these objectives, ultimately contributing to the development of more individualized therapies.