Sub-cellular Protein Localization in Human Protein Atlas
Abstract
While spatial proteomics by fluorescence imaging has quickly become an essential discovery tool for researchers, fast and scalable methods to classify and embed single-cell protein distributions in such images are lacking. Subcellular protein localization is important in understanding the function of different cellular systems and is essential in disease characterization and drug discovery. As the data generated by digital microscopy is increasing rapidly, computational models capable of encoding the protein distribution within cells are necessary to understand how protein distribution contributes to cell function and state.
The Human Protein Atlas (HPA v23) Subcellular Section has generated the first subcellular proteome map of human cells, consisting of a publicly available dataset of more than 100,000 images outlining the expression of around 13,000 proteins (majority of intracellular proteins) across more than 20 cell lines. This consists of roughly 1.5 million cells in the images and data up to 25TB.
Recently, deep-learning approaches have shown great promise in the field and have become state-of-the-art in localization. We aim to use Berzelius’s resources to develop a single-cell foundation model using the images from the HPA dataset. This will serve as the basis for the subcellular localization of the proteins in cell images and provide enormous value to HPA users. Because of the enormous amount of data, we require significant computing and storage resources from Berzelius to undertake this task.
UPDATE FOR CONTINUATION: According to our preliminary data it looks like this model improves over the current state of the art in multiple tasks. For this reason we want to train a larger model and expand on the number of benchmark datasets we're evaluating its performance on.