Spoken conversational AI
| Title: |
Spoken conversational AI |
| DNr: |
Berzelius-2026-17 |
| Project Type: |
LiU Berzelius |
| Principal Investigator: |
Gabriel Skantze <skantze@kth.se> |
| Affiliation: |
Kungliga Tekniska högskolan |
| Duration: |
2026-01-30 – 2026-08-01 |
| Classification: |
10208 |
| Keywords: |
|
Abstract
This is an umbrella project for three individual PhD student projects, funded by WASP:
1. Multimodal Turn-taking prediction in Multi-party conversations:
Turn-taking is the interaction within a conversation system where speakers and listeners alternate roles. In human communication, individuals predict the next speaker using visual cues such as gaze and facial movements, as well as audio cues like subtle pauses and speech context. The goal of this research is to design a multimodal deep learning model to predict turn-taking in multi-speaker settings based on these cues. This work extends previous successful voice activity projection models, which considered only audio cues. By combining visual and audio inputs, the model seeks to enhance turn-taking prediction, contributing to more natural and fluid conversations, especially in human-robot interaction scenarios.
2. Representation learning for Conversational AI
We investigate various methods for self-supervised learning of representations for conversatonal AI and apply them to downstream tasks, such as turn-taking and backchannel predictons.
3. Streaming speech synthesis
Train a model that takes a stream of text coming from an LLM and converts it into speech samples, which may be played immediately to make the communication process between humans and machines smooth.