A Variational Auto Encoder for Classification of Cancer Samples
Title: A Variational Auto Encoder for Classification of Cancer Samples
DNr: Berzelius-2023-232
Project Type: LiU Berzelius
Principal Investigator: Lukas Käll <lukas.kall@scilifelab.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2023-09-08 – 2024-04-01
Classification: 10203
Homepage: https://kaell.org/


Variational Autoencoders (VAEs) have emerged as powerful tools for learning low-dimensional representations of complex, high-dimensional data. They are especially well-suited for capturing the underlying manifold on which the data points lie, offering an unsupervised approach to feature extraction and data characterization. This study employs VAEs to analyze and interpret blood plasma samples obtained from cancer patients participating in the Uniting Cancer and Artificial Intelligence Now (UCAN) project, which incorporates extensive OLINK Proteomics measurements of plasma proteins. Objective The primary aim of this project is threefold: 1) to characterize the molecular heterogeneity among different types of cancer, 2) to identify potential biomarkers that could assist in early diagnosis and targeted treatment, and 3) to examine the prognostic capabilities of the learned low-dimensional features for different types of cancer. Methodology To achieve these goals, a VAE model is trained on a comprehensive dataset comprising blood plasma samples from patients with multiple cancer types such as breast, lung, colorectal, and prostate cancer. The high-dimensional OLINK data, which involve 15 hundred plasma protein measurements, are condensed into a lower-dimensional latent space through the VAE framework. The architecture of the VAE is carefully selected to preserve the critical features of the original dataset while enabling efficient data compression. Various configurations of latent dimensions are experimented with to find the optimal setting for capturing the biological variance. Preliminary Results Initial analyses indicate that the lower-dimensional representations successfully differentiate between the various types of cancer. Furthermore, unsupervised clustering techniques applied to the VAE-generated latent space revealed distinct clusters corresponding to individual cancer types. When aligned with clinical data, these clusters also appeared to have prognostic significance, correlating with factors such as disease stage, overall survival rates, and treatment responsiveness. Moreover, feature importance techniques are applied to the VAE model to identify which plasma proteins impact the learned representations most. These proteins are then evaluated for their potential as biomarkers, and several promising candidates are identified for further validation through targeted proteomic assays. Significance and Future Work This study demonstrates the efficacy of applying VAEs for feature extraction and characterization in the highly complex and heterogeneous landscape of cancer biology. By leveraging the UCAN dataset, which is one of the most diverse and comprehensive collections of cancer-related blood plasma measurements to date, our findings offer new avenues for personalized medicine. The model and methodologies introduced in this project lay the groundwork for future research, including integration with other omics data types like genomics and transcriptomics. In summary, the project establishes a robust framework for using VAEs in cancer research, providing a path towards more effective and personalized cancer diagnosis, treatment, and prognosis. The potential impact is profound: accelerating the identification of novel biomarkers, elucidating the molecular heterogeneity of cancers, and ultimately guiding clinicians in making more informed decisions for patient care.