Evaluating Audio Foundation Models through Paralinguistic, Extralinguistic and Non-speech Vocalization prompts
Title: |
Evaluating Audio Foundation Models through Paralinguistic, Extralinguistic and Non-speech Vocalization prompts |
DNr: |
Berzelius-2025-104 |
Project Type: |
LiU Berzelius |
Principal Investigator: |
Shree Harsha Bokkahalli Satish <shbs@kth.se> |
Affiliation: |
Kungliga Tekniska högskolan |
Duration: |
2025-03-12 – 2025-10-01 |
Classification: |
10208 |
Keywords: |
|
Abstract
Audio Foundation Models (AFMs) claim to overcome the limitations of the conventional ASR - LLM - TTS methodology of interacting with LLMs through voice. They can supposedly understand, respond to, and utilize prosodic cues: paralinguistic (emotion, feelings, attitudes etc.) and extralinguistic (speaker identity, demographic) information when generating a response to the input prompt. This information is lost or muddled during the transition phases in the conventional methodology.
In this work, I'm trying to demonstrate the extent to which certain AFMs can process paralinguistic and extralinguistic information and also show possible biases present in their responses using three modalities of input speech prompts: prosodic, extralinguistic and non-speech vocalizations. The responses are expected to show the relevance of continuing to examine fair and inclusive methodologies for training audio foundation models and also in conversational AI.