Thinking Fast and Slow: Real-time Speech Generation for Conversational AI
Abstract
On a high level, the project aims to make communication with Conversational AI agents more natural by decreasing response times and improving turn-taking. This is expected to be achieved by implementing multiple components to work in conjunction, such as an incremental speech synthesis algorithm, a prefix large language model, and a turn-taking module.
Most state-of-the-art (SOTA) conversational interfaces of today interact through a written chat. Many companies are now trying to implement spoken conversational agents based on Large Language Models (LLM) by connecting an Automatic Speech Recognition (ASR) for input and a Text-to-Speech (TTS) for output; they quickly realize that the resulting interaction is very far from the form of interaction we can have with each other. The response times are very long, and the conversational turn-taking is inflexible. Speech production needs to be planned ahead of time for the prosody to come out naturally and to pause at the right places and with the right turn-taking signals so that the listener can understand that they should wait for more. Replicating the flexible, real-time generation of speech in interaction between humans requires a rethinking of the speech generation process, both from an architectural and modeling perspective, which is one of the key objectives of this doctoral project.
In summary, current SOTA methods are not designed in a way to maintain human-human dialogues, which we will try to implement in this project. The system will include the following components:
- Prefix LLM - this is a small LLM able to generate fast responses to the interlocutor while the main LLM is processing the full response.
- Incremental TTS - the model that takes a stream of text coming from the LLM and converts it into speech samples, which may be played immediately to make the communication process between humans and machines smooth
- Turn-taking model - the module responsible for the decision "who will speak next and when" making the speech interaction sound more natural