Training large language models with single-cell genomics to learn regulatory grammar
Title: Training large language models with single-cell genomics to learn regulatory grammar
DNr: Berzelius-2024-327
Project Type: LiU Berzelius
Principal Investigator: Rickard Sandberg <rickard.sandberg@ki.se>
Affiliation: Karolinska Institutet
Duration: 2024-09-01 – 2025-03-01
Classification: 30401
Homepage: https://sandberglab.se/
Keywords:

Abstract

Recently, foundational models for DNA and RNA sequence have been developed by researchers and industry. Having been trained on genomic sequence data from humans or across mammals, they have learned positional patterns between shorter DNA sequences important for functional regulation in cells (e.g. transcriptional and post-transcriptional level). The Sandberg lab has unique single-cell data with splicing information, and we would like to re-train (or fine-tune) these foundationals models for the task of learning cell-type specific splicing regulaiton. In addition, having access to GPUs at Berzelius would enable us to interactively explore large-context models that laptop or desktop GPUs have insufficient memory to interact with.