High-dimensional entropy estimation with applications to deep learning
|High-dimensional entropy estimation with applications to deep learning
|Viktor Nilsson <email@example.com>
|Kungliga Tekniska högskolan
|2024-02-01 – 2024-08-01
The project's aim is to investigate whether the notion of renormalized mutual information (RMI) is tractable in high dimension. This problem has close connections to the problem of entropy estimation, which is notoriously difficult in high dimension. Initial results indicate that for large datasets like MNIST, features with dimension of moderate sizes allow entropy estimation using the differentiable "KNIFE" estimator.
We aim for a theoretical and empirical study of whether RMI is a useful measure for compression of input information. More specifically, we seek to understand whether a high RMI implies good performance for transfer learning. This would be established by testing the performance of training models for downstream tasks on features that have a wide range of RMI scores. Ideally, we will also develop ways to optimize (maximize) for RMI in high dimension, but this may prove too difficult.
Meeting the goal would have the impact of establishing RMI as a useful measure for information compression in feature learning. This would warrant further research into estimating it efficiently, as well as using it as an optimization objective for unsupervised feature learning or regularization. Also, it would yield a new way to analyze the learning process of common training algorithms such as SGD and ADAM.
Theoretically, it would be interesting to further develop the relation to mutual information and renormalized mutual information. Also, the impact of dimensionality must be investigated. For instance, it is not known whether or how RMI can be compared across different dimensions.
The empirical part of the project hinges on successful use of Parzen-Rosenblatt estimators, or generalizations of this. Rudimentary results about the convergence of Parzen-Rosenblatt for entropy estimation exists, but we would like to obtain results that can explain the faster convergence of methods like KNIFE. Further, since KNIFE is trained with stochastic gradient methods, the dynamics should be developed and analyzed using SDE formulations and the perspective interacting particle systems.