SDRF-Driven, Reproducible DIA-MS Reanalysis of Human Proteomes
| Title: |
SDRF-Driven, Reproducible DIA-MS Reanalysis of Human Proteomes |
| DNr: |
NAISS 2025/3-66 |
| Project Type: |
NAISS Large |
| Principal Investigator: |
Fredrik Edfors <edfors@kth.se> |
| Affiliation: |
Kungliga Tekniska högskolan |
| Duration: |
2026-01-01 – 2026-07-01 |
| Classification: |
30113 40303 10401 |
| Keywords: |
|
Abstract
We will re-analyze large public Data-Independent Acquisition mass spectrometry (DIA-MS) datasets from human tissues, cell lines and biofluids to generate harmonized protein quantifications, QC summaries and reusable metadata for integration with the Human Protein Atlas (HPA). The pipeline centers on QUANTMS (Nextflow) with DIA-NN for DIA and optional mzML conversion when needed, executed reproducibly via Apptainer/Singularity.
Major bottlenecks in large-scale reanalysis of LC-MS/MS data includes: file conversion, metadata gaps, and workflow restarts, which are addressed in three steps: (i) systematic SDRF-Proteomics curation for every dataset; (ii) version-pinned containers and manifests; and (iii) checkpointed, array-based job orchestration on NAISS. Target outputs include per-study peptide/protein matrices, standardized QC, and SDRF files enabling downstream biological interpretation and benchmarking. No sensitive personal data is processed: all inputs are public ProteomeXchange datasets; only pseudonymous file identifiers and experimental factors are handled. NAISS Large resources will provide the throughput and storage headroom required for approximatley 50 TB of RAW inputs and ~200 TB of intermediates at multi-million-file scale, with rolling cleanup after each study. Results will be disseminated via HPA-linked resources and open repositories together with methods and SDRF metadata.