Hidden neural networks for nanopore DNA sequencing
Reductions in the time and cost of DNA sequencing DNA is about to enable clinical DNA sequencing for personalized medicine, particularly due to the third generation of DNA sequencing technologies that rely on ionic current measurements through biological nanopores. The next step in sequencing technology is likely to be based on solid-state nanopores, reducing sequencing cost even further to accelerate clinical use of DNA sequencing. Nanopore sequencing relies on the modulation of an ionic current by the individual nucleobases of a single-stranded DNA molecule passing through the nanopore. Still, the interpretation of modulated current relies on advanced machine learning technologies. The main difficulties of interpreting the modulated current, the so-called base-calling step, stems from: 1) the stochastic translocation speed of the DNA through the nanopore; 2) the simultaneous influence of several neighboring nucleobases on the current through the pore; and 3) the similar size and current-signature of purine bases (A and G) and pyrimidine bases (C and T).
We are currently working within the Swedish research council (VR) funded interdisciplinary research environment grant QuantumSense (grant no 2018-06169) to realize solid-state nanopore sequencing technology. My group is within the project that also involves research in nanotechnology, microfluidics, surface chemistry, and genomics responsible for developing signal processing and machine learning methods needed to interpret the modulated currents. Due to the current unavailability of the solid-state nanopores to be developed, we currently work with commercialized biological nanopores from Oxford Nanopore Technologies (ONT), for which data is now available at scale. The aim is to use ONT data as a model to reduce the development time of the envisioned solid-state nanopore technology.
The current state-of-the-art base-calling for ONT data relies on deep learning, often using recurrent neural networks (RNNs) coupled with a connectionist temporal classification (CTC) block to account for the stochastic DNA translocation speed. We believe that a structure known as explicit duration hidden Markov models (ED-HMMs), coupled with a deep neural network implementation of the so-called match network replacing classical observation probabilities, is another viable approach. This combination is also known as a hidden Neural network in natural language processing (NLP). A particularily salient features of this approach is that we can explicitely model the stochastic process controlling DNA translocation to account for time-varying mean translocation speeds throughout the sequencing experiments. It is also possible to end-to-end training the neural networks using the data likelihood computed by the HMM as the (negative) loss function.
We are currently building and training such base-calling inference structures using Tenforflow models with custom CUDA tensor operations for our very large and sparse hidden Markov state-space models. We are very close to the state-of-the-art performance for ONT data. Access to the Berzelius cluster would allow us to scale up the neural network models and the size of the training data set to realize the full potential of our novel approach.