Probabilistic speech synthesis using Neural HMMs
||Probabilistic speech synthesis using Neural HMMs|
||Shivam Mehta <firstname.lastname@example.org>|
||Kungliga Tekniska högskolan|
||2022-09-14 – 2023-04-01|
Speech synthesis is a one-to-many problem requiring sampling probabilistically to synthesise natural human speech. We have developed a Neural-HMM framework which fuses robust and powerful neural networks with probabilistic methods like Hidden Markov Models (HMMs). Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This project fuses the old and new paradigms to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we have modified Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. But the forward algorithm sums over all the hidden states per timestep to compute the probability at each timestep this increases the computation tape of frameworks like PyTorch thus we require powerful computation resources to train these networks.