A foundation model for mass spectra of peptides
Title: A foundation model for mass spectra of peptides
DNr: Berzelius-2024-32
Project Type: LiU Berzelius
Principal Investigator: Lukas Käll <lukas.kall@scilifelab.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2024-01-30 – 2024-08-01
Classification: 10203


Objective: The primary objective of this project is to develop a transformer-based foundation model optimized for analyzing the mass spectra of peptides. We intend to harness the attention mechanisms inherent in transformer architectures to discern intricate relationships within spectral data, obviating the need for explicit peptide sequence knowledge. Background: In recent years, transformer-based models have emerged as powerful tools for various tasks across multiple domains. Although their utility in learning patterns and representations from data is proven, their application to mass spectra, especially peptides, remains underexplored. By building a more sophisticated model, we aim to unlock several downstream applications that could significantly enhance the field of mass spectrometry. Collaboration: This endeavor is being undertaken in collaboration with Mathias Wilhelm's lab at the Technical University Munich (TUM), leveraging their profound expertise in the domain to ensure the efficacy and relevance of our foundation model. Potential Applications: * De Novo Sequencing of Spectra - Streamlining the process of determining amino acid sequences in proteins without prior knowledge. * Spectra Clustering - Efficient grouping of mass spectra based on inherent patterns and similarities. * Improved Peptide Spectrum Match (PSM) Formation - Enhancing the accuracy and reliability of matching observed spectra to known sequences. Data Collection: We plan to collate a comprehensive dataset, comprising MS2 spectra from a plethora of sources, to encompass diverse instrumentation, fragmentation mechanisms, and causative peptides from the PRIDE and ProteomicsDB MS data repositories. Model Training: Our foundation model will be rigorously trained on this expansive dataset. The primary focus will be to ensure the model can navigate and interpret various spectral complexities, while simultaneously disentangling confounding factors. We will adopt a masked prediction approach for the training task, wherein certain portions of the spectra will be concealed from the predictor. This approach is envisaged to foster the model's proficiency in capturing a holistic and accurate representation of spectra. Expected Outcomes: Upon successful completion, we anticipate our foundation model to establish a new benchmark in the realm of mass spectrometry analysis. We believe that this model can play an instrumental role in catalyzing advancements in peptide research and mass spectrometry techniques.