Embodied perception for interacting with the 3D world
Title: Embodied perception for interacting with the 3D world
DNr: Berzelius-2024-331
Project Type: LiU Berzelius
Principal Investigator: Hedvig Kjellström <hedvig@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2024-09-28 – 2025-04-01
Classification: 10207
Homepage: https://www.kth.se/profile/hedvig
Keywords:

Abstract

Embodied artificial intelligence is important for developing human assistive technologies to support our daily lives, one key requirement of it is the ability to understand and interact with the 3D world. This project aims to develop embodied perception techniques for endowing machines and virtual agents with the ability to model the 3D body of subjects, i.e., humans and animals, from 2D videos, then understand and interact with them for downstream applications. This project defines 5 stages of studies. Firstly, we Model the real-world subjects, i.e., humans and animals from machines’ perspectives. In particular, we develop improved 3D models that can encode fine details and fast motions of human bodies in different growth states, e.g., newborns, toddlers, teens, and adults. For animals, we focus on developing 3D models for quadruped animals. More specifically, we start to model specific species like horses and dogs, and probably extend to cattle and other species. Additionally, we Estimate the subjects’ 3D model representations from 2D videos while preserving the social signal such as emotion conveyed by the subjects as comprehensively as possible. On the one hand, the estimated 3D models will capture the expressive social properties, such as pose, shape, facial expressions, and body behaviors, of the subjects, i.e., humans and animals, from the videos. On the other hand, in addition to these expressive properties, the 3D models will also encode spontaneous cues revealed by the subjects unconsciously, such as head pose, gaze, facial micro-expression, and micro-gestures. Moreover, we Infer the semantic meaning of 3D models in application scenarios, e.g. the activity and emotion of the motion. This project will develop methods for understanding drivers’ affect state and frustration level, analyzing facial expressions and body gestures of neonates for automated analysis of their neurodevelopmental situations, recognizing the underlying message within the communication between conductors and orchestras, and analyzing animal behaviors to allow humans to get an insight into the mind of animals in their care. Furthermore, we Synthesize 3D pose and shape data to generate new training materials to facilitate downstream task model training. Large-scale data is a key success factor in training deep learning models. However, collecting 3D data is extremely costly. This project will develop efficient 3D model synthesis techniques that can transfer existing poses and activities to new subjects/species, or generate new unseen poses and activities on existing subjects/species using multimodal large language models. Lastly, we Evaluate generative models and generated samples. Generative models are revolutionizing many industries and professions, it is particularly crucial to understand the quality of the model and generated samples. However, due to the model and sample quality being less intuitive to ordinary users, measuring it usually requires experts. In this part, the project will design and evaluate reliable and aligned automated methods and metrics for generative models and generated samples.