Generative deep learning for data-centric machine learning
Title: Generative deep learning for data-centric machine learning
DNr: Berzelius-2024-211
Project Type: LiU Berzelius
Principal Investigator: Gabriel Eilertsen <>
Affiliation: Linköpings universitet
Duration: 2024-05-16 – 2024-12-01
Classification: 10201


Machine learning, especially by means of deep learning, has made substantial progress over the last decade. However, the data-hungry nature of deep learning means that the full potential of a model is often inhibited by lack of data. This problem is especially pronounced within medical imaging, where data is expensive to capture, relies on medical expertise for annotation, and is of sensitive and protected nature. Synthetically generated images can be used to improve image-based deep learning applications, both by increasing the amount of training data and by ensuring that different types of image content is included. Traditionally, computer graphics has been used for this purpose, but requires modeling of the image content. While this in many cases can be accomplished for natural images, it is difficult to model the complex biological content depicted in medical images. An alternative solution is to use deep learning for automatic generation of new image content, by means of generative adversarial networks (GANs) or generative diffusion models (GDMs). Over the last few years, research on generative deep learning has progressed to the point that photo-realistic images can be generated. At the same time, GANs and GDMs have mostly been used to generate 2D images of limited resolution. In medical imaging, data modalities can be more challenging, such as 3D volumes in radiology, or giga-pixel whole slide images (WSIs) in digital pathology. Furthermore, it is problematic to preciseliy control the content generated by generative models. This project aims at combining computer graphics and generative deep learning, in order to produce high-quality synthetic image datasets with detailed control over the image content. The overarching goal is a data-centric perspective of deep learning, where generated content can improve performance and robustness in limited data scenarios and aid in analyzing model performance under different types of variations. Additional applications of generated images are also of high relevance to the project, such as out-of-distribution detection and considering the fairness of generated image distributions.