Accelerating immunoglobulin gene annotations
||Accelerating immunoglobulin gene annotations|
||Benjamin Murrell <firstname.lastname@example.org>|
||2023-11-04 – 2024-05-01|
One strategy of profiling the immune landscape of individuals is by "immune repertoire sequencing", which deeply sequences the genes that encode B-cell receptors and T-cell receptors, generating millions of sequences for a single sample.
Since the process of immune receptor formation is complex, involving first the recombination of a number of germline "Variable", "Diversity" and "Joining" genes, which can each originate from a large number of distinct alleles, any downstream analysis begins with a formidable annotation task, which assigns individual sampled reads to their putative originating V, D, and J alleles, and annotates the recombination breakpoints.
This is usually approached using alignment-based strategies (see eg. https://www.imgt.org/IMGTindex/V-QUEST.php or https://www.ncbi.nlm.nih.gov/igblast/ for commonly used tools), but the motivation behind this proposed project is to explore strategies that replace alignment algorithms with deep learning. This has the potential to capture effects that can't be observed with alignment-based approaches, but, importantly, can also massively accelerating the rate at which data can be processed, since the proposed models can exploit GPU acceleration.
Regarding data privacy, we will only use Berzelius for training, which is performed on simulated data (where the ground truth is known), or publicly-available sequence data that has no privacy concerns. With models trained on Berzelius, we can run inference locally on datasets that have privacy concerns.
We believe the standard resource allocation to be sufficient for our current project. Our models are developed in Julia, and we have already tested that these frameworks are compatible with the Berzelius infrastructure through project Berzelius-2023-277.