Deep Learning for the Physical World
||Deep Learning for the Physical World|
||Mårten Björkman <email@example.com>|
||Kungliga Tekniska högskolan|
||2023-10-01 – 2024-04-01|
This proposal involves two separate research tracks that both explore deep learning for representation of complex real-world image data. In the first track, we study the observed generalization abilities of deep networks and attempt to explain the mechanisms behind this ability. The second track starts at the other end, with the data, rather than the network. The goal is to find the most suitable network structures for the representation of complex sequential human movement data, more specifically dance movements and locomotion data.
Track 1: Understanding Learning and Memorization in Deep Networks on Natural Image Data
Deep networks achieve state-of-the-art performance on several tasks within the natural world, showing a remarkable generalization ability, despite training sets that appear too small in relation to the capacity of networks. This track will investigate mechanisms behind this ability and the role played by the nature of training data, the network structure, and the optimization procedure for training the network. It has been observed that generalizing networks tend to express smooth functions of the training data, and that input-space smoothness can be predictive of generalization. What is still unknown, however, is what factors control this smoothness. Using standardized datasets for image classification, a large number of experiments will be conducted on networks of various sizes to empirically find measures that correlate with smoothness and generalization. Of particular interest is to explain the ‘double descent’ behaviour observed in training of large networks, i.e. as networks increase in size, functions become more complex, but at some point, functions become smoother while still perfectly fitting the training data.
Track 2: Modelling the Sequential Data and the Time Dimension
In recent years, so-called generative models have been developed to model and sample from complex distributions, but so far these models have rarely been used for sequential data, such as human movement. This project will explore and develop diffusion models, a novel family of generative models, for human dance movement generation given music. Dance requires the skilful composition of complex movements that follow the music's rhythmic, tonal and timbral features. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. We will create a novel probabilistic autoregressive architecture that models the distribution over future poses with a diffusion model conditioned on previous poses and music context, using a multimodal transformer encoder. Initial experiments will also be conducted on less complex locomotion data, in order to gracefully learn from and prepare for the more complex experiments on dance movements.