Deep learning for simultaneous localization and mapping
Abstract
This continuation of the initial, successful project will work towards several datasets to extend the approach and verify our current state-of-the-art methods in a broader context. Our aim is to further explore the multi-modal problem, mainly Audiovisual localization, tracking, navigation or mapping. The focus is however still the Sound Source Localization part, and an interesting problem is multi-source situations. We will also consider optimization of the hardware usage, e,g, better GPU utilization or parallelization.
In this project we are using machine learning for audio localization problems. Given sound recordings from several microphones and their ground truth position, we train a model to estimate the position of the sound source. Doing this requires training large transformer models on large datasets. Methods that are not based on deep learning are also applicable to continue learning about “physics informed” approaches that may eventually inspire learning methods.
We will continue to consider the LuViRA dataset (https://arxiv.org/abs/2302.05309) which consists of ground truth robot trajectories recorded in a motion capture lab. The task consists of localizing a robot given audio recordings from microphones in the room. Long term, the goal is to incorporate the other data modalities of the dataset, which are MIMO radio signals and RGB+depth camera images. Our initial investigation on a small part of the dataset has shown that it is possible to obtain good localization performance using only a small fraction of the audio data. Using the entire dataset, we expect that we can train a state-of-the-art acoustic localization model, but there are still challenges to overcome in terms of efficient machine learning.