Multimodal Conversation model for Turn taking prediction event
Title: Multimodal Conversation model for Turn taking prediction event
DNr: Berzelius-2024-411
Project Type: LiU Berzelius
Principal Investigator: Haotian Qi <haotianq@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2024-10-24 – 2025-05-01
Classification: 10204
Keywords:

Abstract

Turn-taking is the interaction within a conversation system where speakers and listeners alternate roles. In human communication, individuals predict the next speaker using visual cues such as gaze and facial movements, as well as audio cues like subtle pauses and speech context. The goal of this research is to design a multimodal deep learning model to predict turn-taking in multi-speaker settings based on these cues. This work extends previous successful voice activity projection models, which considered only audio cues. By combining visual and audio inputs, the model seeks to enhance turn-taking prediction, contributing to more natural and fluid conversations, especially in human-robot interaction scenarios.