Graph-based, spatial, temporal and generative machine learning
Title: Graph-based, spatial, temporal and generative machine learning
DNr: Berzelius-2023-259
Project Type: LiU Berzelius
Principal Investigator: Fredrik Lindsten <fredrik.lindsten@liu.se>
Affiliation: Linköpings universitet
Duration: 2023-10-01 – 2024-04-01
Classification: 10201
Keywords:

Abstract

This is a joint proposal for 5 separate projects with the same PI. The projects are outlined below. Discriminator Guidance for Autoregressive Diffusion Models Diffusion models [1, 2] is a type of generative AI model that has seen an impressive increase in interest recently, in particular in the domain of generating images, but also for other data modalities, and many extensions and improvements have been proposed. A recent work introduced “Discriminator Guidance” [3] as a way for improving diffusion models for continuous data: a model, called discriminator, is trained to distinguish between real data and data generated by a pre-trained diffusion model. The discriminator is subsequently used in the generative process together with the pre-trained model. This is empirically shown to improve the generative performance. However, diffusion models for discrete data are formulated very differently compared to the continuous case, and it is hence not straight forward to use discriminator guidance in the discrete case. Therefore, in this project, we set out to formulate discriminator guidance for discrete diffusion models, and in particular in the case of Autoregressive Diffusion Models (ARDMs) [4]. Although this project is focused on method development and as such is rather general, the type of data we have in mind is graphs, and in particular molecular graphs. Generating new molecules is highly interesting for important applications such as drug discovery and materials science. We plan on publishing this work at a top-tier machine learning conference. [1] Denoising Diffusion Probabilistic Models, Ho et al, NeurIPS 2020 [2] Score-Based Generative Modeling through Stochastic Differential Equations, Song et al, ICLR 2021 [3] Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models, Kim et al, ICML 2023, [4] Autoregressive diffusion models, Hoogeboom et al, ICLR, 2022 Graph-based Deep Weather Forecasting In this project we investigate the use of deep learning for Numerical Weather Prediction (NWP). Recent works [1, 2] have shown that graph-based machine learning models can learn to highly accurately approximate NWP systems while producing predictions in a fraction of the time used by the original system. When incorporating real observations the deep learning models can even surpass the accuracy of the original NWP systems. We have been working in collaboration with SMHI to develop a model capable of forecasting the weather in the Nordic region. A first publication around this is currently being drafted. We have many ideas on improving this model and have received much valuable input from the meteorology community. One part of the continuation of this project is to further develop the model for the Nordic area and to show how machine learning methods can be made useful for local area weather forecasting. Another part of this project is to further develop the involved machine learning methodology. Such extensions include more refined modelling of the temporal dependencies and improved methods for quantifying the uncertainty in predictions. We aim to publish results from this research both in the machine learning and NWP literature. [1] GraphCast: Learning skillful medium-range global weather forecasting, Lam et al., preprint, 2022 [2] Forecasting Global Weather with Graph Neural Networks, Keisler, R., preprint 2022. Non-conjugate Deep GMRFs In our previous work [1] we showed how the model class of Deep GMRFs [2] can be applied to general graph-structured data, capturing uncertainty through Bayesian inference. The approach is however limited to real-valued prediction targets with additive Gaussian noise. In this project we aim to enable the use of Deep GMRF models for new problems and new types of data. In the initial phase of this project we have developed techniques for performing probabilistic inference in these models. The next steps include extending the capabilities of the model and more extensive experimentation. [1] Scalable Deep Gaussian Markov Random Fields for General Graphs, Oskarsson et al., ICML 2022 [2] Deep Gaussian Markov Random Fields, Sidén and Lindsten, ICML 2020 Understanding Epoch-wise Double Descent in Deep Neural Networks In the classical bias-variance tradeoff, increasing the number of parameters in your model initially reduces, then increases the expected risk, giving rise to a U-shaped risk curve. The traditional belief is that adding more model parameters beyond the point of overfitting leads to poor generalization. However, in contrast to this belief, evidence shows that when entering the overparameterized regime, the risk can decrease yet again, giving rise to the so called double descent pattern, see e.g. [1]. While the typical notion of double descent usually refers to this type of parameter-wise double descent, a similar pattern has also been observed with the number of training epochs, denoted epoch-wise double descent. Simply put, when training an overparameterized model using iterative training algorithms such as gradient descent, continuing training after the model has seemingly overfitted to the training data can eventually give rise to a model with better generalization properties. While epoch-wise double descent has been studied in previous work, the underlying mechanisms of this double descent phenomenon, and when and why it happens, is still not fully understood. Due to the high complexity of the dynamics of training deep neural networks, it is not uncommon that the theoretical study of epoch-wise double descent is limited to standard linear regression models, and that conclusions are directly extrapolated to deeper models, see e.g. [2, 3, 4]. However, it is not yet established that the epoch-wise double descent pattern observed for linear regression models is the same as what is observed for deeper models, such as deep neural networks, i.e. if the behavior and causes coincide. Other factors might affect double descent in deeper models, such as the interaction between different model layers. Indeed, for parameter-wise double descent, previous work indicates that there is a difference between the double descent pattern observed in linear regression and that observed in deeper models [5]. Hence, this is likely to be the case also for epoch-wise double descent. In this project, we aim to further investigate epoch-wise double descent in deep neural networks, to better understand why epoch-wise double descent happens and what causes it. We also aim to understand how epoch-wise double descent in deeper models relate to the same phenomenon observed in standard linear regression, and to identify factors of double descent that are present in deeper models but not in the simpler model class. While a better understanding of double descent is interesting by itself, it can also be helpful in tasks such as model selection. For example, it can be used to determine when it is appropriate to apply traditional regularization techniques such as early stopping and when it is not. The project will include both theoretical and empirical analyses. We plan on publishing this work at a top-tier machine learning conference. [1] Belkin, M., Hsu, D., Ma, S., & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854, 2019. [2] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003, 2021. [3] Heckel, R., & Yilmaz, F. F. Early stopping in deep networks: Double descent and how to eliminate it. In: International Conference on Learning Representations, 2021. [4] Stephenson, C., & Lee, T. When and how epochwise double descent happens. arXiv preprint arXiv:2108.12006, 2021. [5] Pezeshki, M., Mitra, A., Bengio, Y., & Lajoie, G. (2021). Multi-scale Feature Learning Dynamics: Insights for Double Descent. In: International Conference on Machine Learning, 2022. Towards Better Understanding and Training of Generative Flow Networks Generative Flow Networks (GFlowNets) [1] were proposed to generate a diverse set of composite objects through a sequence of constructive actions, with a probability approximately proportional to a given positive reward function. By exchanging the complexity of sampling through long chains of MCMC algorithms for the complexity of training a generative policy, GFlowNets can take advantage of the generalizable structure to guess reasonably-well yet-unvisited modes. While reward maximization with appropriate entropy regularization can sample proportionally to the target reward function, it works only when there is one path leading to a state. GFlowNets address this by taking advantage of flow networks to convert the Markov decision process into a directed acyclic graph, containing all possible trajectories to construct objects from scratch. This makes it possible to handle more general cases, where there are multiple paths leading to the same states. The crucial theorem in [1] states that if a flow function can be trained to satisfy flow-matching (FM) constraints, the learned policy of GFlowNets can sample from the target reward function. An equivalent objective based on detailed-balance (DB) constraints [2] was proposed to avoid summing over many preceding states and allow generalization to continuous spaces. Furthermore, a global objective based on trajectory-balance (TB) constraints [3] was proposed to compute over a complete trajectory, thus providing direct credit assignment to all the states visited in a complete trajectory. More recently, a convex combination of all the sub-trajectory balance (SubTB) constraints [4] was proposed to learn from incomplete trajectories with variable lengths. GFlowNets trained with the TB objective can yield efficient credit assignment, but they have been shown to suffer from high variance. This can be alleviated by introducing a state flow function, giving rise to both DB and SubTB objectives, which learn to satisfy local balance constraints within complete trajectories. However, it remains unclear why the state flow function can derive lower-variance gradient estimates. We hypothesize that they provide mechanisms to derive local and asynchronous gradient estimates, thus reducing the reliance on a full model evaluation. In this project, we plan to characterize the role of the state flow function. With such an understanding, we can know how and under which conditions they can be applied to obtain both lower variance and local estimates of the model gradients, thus providing opportunities to develop a more general and efficient credit assignment mechanism in GFlowNets. We plan to publish a paper on a top-tier machine learning conference. [1] Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation. Emmanuel Bengio et al. NeurIPS, 2021. [2] GFlowNet Foundations. Yoshua Bengio et al. JMLR, 2023. [3] Trajectory Balance: Improved Credit Assignment in GFlowNets. Nikolay Malkin et al. NeurIPS, 2022. [4] Learning GFlowNets from Partial Episodes for Improved Convergence and Stability. Kanika Madan et al. ICML, 2023.