Advanced 3D Perception for Environment Understanding and Representation
Title: Advanced 3D Perception for Environment Understanding and Representation
DNr: Berzelius-2024-238
Project Type: LiU Berzelius
Principal Investigator: Patric Jensfelt <patric@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2024-07-01 – 2025-01-01
Classification: 10207
Homepage: https://www.kth.se/profile/patric
Keywords:

Abstract

Within this research project we have 6 PhD students alongside their research engineers, working on perception, 3D environment understanding and representation. 3D perception is a multidisciplinary research field dedicated to the extraction of spatial information from two-dimensional images, enabling the creation of three-dimensional representations of the visual world. Combining computer vision with geometric reasoning, this facilitates applications ranging from object recognition and scene reconstruction to autonomous navigation systems. In autonomous driving, multi-modal sensor data, including lidar and camera, is often available. Fusing this multi-modal information into a reliable scene representation is crucial in traffic scenarios involving multiple agents, where collision risk is significant. Researchers are working on developing better data representations using neural networks to improve modeling and prediction in multi-agent scenarios. While this task is manageable with ground truth (GT) labels, it becomes increasingly difficult in their absence. Furthermore, due to sensor differences between self-driving datasets, models trained on one dataset cannot be easily deployed on another dataset or vehicle. Self-supervised representation learning can help models extract meaningful features without GT labels. Additionally, in combination with contrastive and adversarial learning approaches it can contribute to developing data-invariant models. Similarly, it is important for it to accurately localize itself in the environment. Tasks involving localization and camera pose estimation are often extremely challenging, especially in ambiguous environments. Recently, various vision transformer architectures have been proposed for the purpose of estimating the relative camera pose between different viewpoints of a common scene. Additionally, epipolar geometry can be used to further refine these predictions. Nevertheless, the use and combination of these two approaches are not yet well-established in the field.