Spoken conversational AI
Title: Spoken conversational AI
DNr: Berzelius-2026-17
Project Type: LiU Berzelius
Principal Investigator: Gabriel Skantze <skantze@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2026-01-30 – 2026-08-01
Classification: 10208
Keywords:

Abstract

This is an umbrella project for three individual PhD student projects, funded by WASP: 1. Multimodal Turn-taking prediction in Multi-party conversations: Turn-taking is the interaction within a conversation system where speakers and listeners alternate roles. In human communication, individuals predict the next speaker using visual cues such as gaze and facial movements, as well as audio cues like subtle pauses and speech context. The goal of this research is to design a multimodal deep learning model to predict turn-taking in multi-speaker settings based on these cues. This work extends previous successful voice activity projection models, which considered only audio cues. By combining visual and audio inputs, the model seeks to enhance turn-taking prediction, contributing to more natural and fluid conversations, especially in human-robot interaction scenarios. 2. Representation learning for Conversational AI We investigate various methods for self-supervised learning of representations for conversatonal AI and apply them to downstream tasks, such as turn-taking and backchannel predictons. 3. Streaming speech synthesis Train a model that takes a stream of text coming from an LLM and converts it into speech samples, which may be played immediately to make the communication process between humans and machines smooth.