Helicase HMM modelling of DNA Nanopore data
Reductions in the time and cost of DNA sequencing DNA are about to enable clinical DNA sequencing for personalized medicine, mainly due to the third generation of DNA sequencing technologies that rely on ionic current measurements through biological nanopores. The next step in sequencing technology will likely be based on solid-state nanopores, further reducing sequencing costs to accelerate the clinical use of DNA sequencing. Nanopore sequencing relies on the modulation of an ionic current by the individual nucleobases of a single-stranded DNA molecule passing through the nanopore. Still, the interpretation of modulated current relies on advanced machine learning technologies. The main difficulties of interpreting the modulated current, the so-called base-calling step, stem from: 1) the stochastic translocation speed of the DNA through the nanopore; 2) the simultaneous influence of several neighboring nucleobases on the current through the pore; and 3) the similar size and current-signature of purine bases (A and G) and pyrimidine bases (C and T); and 3) that long sequences of the same nucleotide, i.e., homolpolymers, lead to minor variations of the measured currents forcing the reliance of accurate statistical models of translocation.
The current state-of-the-art base-calling for ONT data relies on deep learning, often using recurrent neural networks (RNNs) coupled with a connectionist temporal classification (CTC) block to account for the stochastic DNA translocation speed. A structure known as explicit duration hidden Markov models (ED-HMMs) is another viable approach. We have previously explored a Hidden neural network approach with the help of computational resources from the Berzelius cluster. This approach uses a hybrid model where a Neural (match) network with 15 million weights replaces the HMM observation probabilities, and we trained the full model incorporating the HMM end-to-end using the sequence likelihood computed via graphical models as the (negative) loss. We used custom-written cuda kernels for all HMM operations (typically sparse forward-backward algorithms) integrated into Tesorflow for end-to-end training to ensure efficient computations. The paper that describe these efforts are available at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05580-x
Now we wish to explore an even more advanced idea where we explicitly model the ratcheting helicase molecules with a state-space machine within the HMM models. The helicase is responsible for splitting the double helix and feeding a single-stranded DNA molecule through the nanopore, and modelling this molecule allows us to use its statistical properties with the basecalling process, which we believe will provide us with better capabilities to basecall homopolymers which is arguably the most challenging problem in basecalling nanopore data but is also of high clinical and scientific importance. With the lastest chemistries (R10) we aim for a so-called 10-mer model with 5 internal Helices states, resulting in an over 5 million state HMM. Learning this model requires substantial computational resources and large amounts of sequencing data which is why access to the Berzelius cluster is essential.