Training large language models with single-cell genomics to learn regulatory grammar
	  
	  
| Title: | Training large language models with single-cell genomics to learn regulatory grammar | 
| DNr: | Berzelius-2024-327 | 
| Project Type: | LiU Berzelius | 
| Principal Investigator: | Rickard Sandberg <rickard.sandberg@ki.se> | 
| Affiliation: | Karolinska Institutet | 
| Duration: | 2024-09-01 – 2025-03-01 | 
| Classification: | 30401 | 
| Homepage: | https://sandberglab.se/ | 
| Keywords: |  | 
  Abstract
  Recently, foundational models for DNA and RNA sequence have been developed by researchers and industry. Having been trained on genomic sequence data from humans or across mammals, they have learned positional patterns between shorter DNA sequences important for functional regulation in cells (e.g. transcriptional and post-transcriptional level). The Sandberg lab has unique single-cell data with splicing information, and we would like to re-train (or fine-tune) these foundationals models for the task of learning cell-type specific splicing regulaiton. In addition, having access to GPUs at Berzelius would enable us to interactively explore large-context models that laptop or desktop GPUs have insufficient memory to interact with.