Multimodal Natural Conversation Perception and Generation
| Title: |
Multimodal Natural Conversation Perception and Generation |
| DNr: |
Berzelius-2026-153 |
| Project Type: |
LiU Berzelius |
| Principal Investigator: |
André Tiago Abelho Pereira <atap@kth.se> |
| Affiliation: |
Kungliga Tekniska högskolan |
| Duration: |
2026-06-01 – 2026-12-01 |
| Classification: |
10208 |
| Keywords: |
|
Abstract
This project is part of the WASP Media & Language Research Arena project BELLA: Building Expressive Language for Likeable Agents. The goal is to develop multimodal foundation-model approaches for natural embodied conversation, focusing on how social robots and virtual agents can perceive, represent, and generate socially meaningful behaviour.
Current language and multimodal models are increasingly powerful, but they still struggle with the fine-grained coordination of speech, gaze, facial expressions, gestures, turn-taking, task context, and listener feedback that characterizes natural human interaction. This project will investigate how such multimodal social cues can be represented as embeddings, tokens, or semantic tags and integrated with dialogue models for perception and generation.
The main application domain is embodied interaction in social robotics and cooperative tabletop games, where agents must combine dialogue, task reasoning, gaze, and non-verbal behaviour. A related methodological strand will connect this work to fMRI-based modeling of natural conversation, where extracted multimodal behavioural embeddings are linked to brain-derived representations of conversational perception and engagement.
The requested resources will be used for large-scale multimodal feature extraction, LLM/VLM inference, parameter-efficient fine-tuning, temporal multimodal modeling, and ablation studies comparing unimodal, multimodal, and brain-aligned representations.