Deep Learning for the Physical World
Title: |
Deep Learning for the Physical World |
DNr: |
Berzelius-2025-131 |
Project Type: |
LiU Berzelius |
Principal Investigator: |
Mårten Björkman <celle@kth.se> |
Affiliation: |
Kungliga Tekniska högskolan |
Duration: |
2025-04-01 – 2025-10-01 |
Classification: |
10210 |
Keywords: |
|
Abstract
This proposal involves three separate research tracks that both explore deep learning for representation of complex real-world image data. In the first track, we study the observed generalization abilities of deep networks and attempt to explain the mechanisms behind this ability. The second track starts at the other end, with the data, rather than the network. The goal is to find the most suitable network structures for the representation of complex sequential human movement data, more specifically dance movements and locomotion data.
Finally, the third track
Track 1: Understanding Learning and Memorization in Deep Networks on Natural Image Data
Deep networks achieve state-of-the-art performance on several tasks within the natural world, showing a remarkable generalization ability, despite training sets that appear too small in relation to the capacity of networks. This track investigates mechanisms underpinning this ability and the role played by the network architecture and the optimization procedure for training the network. Of particular interest is the study of robustness emerging from large-scale training, vis-à-vis neural scaling laws, namely a series of phenomenological laws relating training set size and number of model parameters to the final model performance. Extending prior work from the research group, the project will study scaling laws for robustness of neural representations for models training in the self-supervised learning paradigm.
Track 2: Modelling the Sequential Data and the Time Dimension
In recent years, generative models have been developed to fit and sample from complex distributions, but so far these models have rarely been used for sequential data, such as human movement. This project will explore and develop diffusion models, a novel family of generative models, for human dance movement generation given music. Dance requires the skillful composition of complex movements that follow the music's rhythmic, tonal and timbral features. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. The project proposes a probabilistic autoregressive architecture that models the distribution over future poses, using a multimodal transformer encoder. Initial experiments will also be conducted on less complex locomotion data, in order to gracefully learn from and prepare for the more complex experiments on dance movements.
Track 3: Neural representation and rendering
While previous methods in the area of neural representation and rendering were limited to static scenes, current research starts to focus on dynamic and deformable scenes as in traffic scenarios or casually captured videos from smartphones. However, those methods are limited for large deformations, cannot account for emerging scene content and take several days on multiple GPUs of optimization time for a single scene. This project intends to overcome these limitations by exploring generalizable methods, which, instead of being optimized on a single scene, are trained across thousands of different scenes and are then able to quickly account for images of a previous unseen scene without additional training time.