Enabling Document Image Generation with Diffusion Models
Title: Enabling Document Image Generation with Diffusion Models
DNr: Berzelius-2024-135
Project Type: LiU Berzelius
Principal Investigator: Hamam Mokayed <hamam.mokayed@ltu.se>
Affiliation: Luleå tekniska universitet
Duration: 2024-03-28 – 2024-10-01
Classification: 20206


With the advent of digitization, the automated processing and analysis of document images have emerged as critical tasks, especially given the complex, multimodal, and multilingual nature of modern and historical document datasets. These datasets present unique challenges due to the inherent variations in background, writing/font style, layout complexity, and the conditions under which the documents have been preserved. Traditional data collection methods are time-consuming and expensive and often fall short in providing varying and extensive annotations required to train efficient deep learning models for comprehensive document image processing, analysis, and understanding tasks that could be used in reading systems. The project aims to employ cutting-edge techniques based on Diffusion Models for handwriting and historical document image generation and enhancement. This approach will enable deep learning model training for a wide range of document analysis downstream tasks, such as handwriting text recognition, keyword spotting, writer style identification, among others. By leveraging the capabilities of these advanced methods, we can address the complex and unique variabilities in handwriting and historical documents, enabling controllable text generation systems that create high quality and diverse samples. Controllability is a crucial aspect as it allows for the generation of text that meets specific writing styles, scales, and layout requirements. This not only enhances the practical applicability and functionality of the systems and personalization but also aids in obtaining the ground truth of the generated samples, paving the way for large-scale synthetic datasets to enable Document Image Analysis. Overall, this project seeks to bridge the gap between the capabilities of recent Deep Generative Model technologies and the urgent need for annotated, extensive training datasets. We intend to train and optimize diffusion models, which require significant computational intensity due to their iterative nature and complex architecture. Given the multimodal nature of document images, both Computer Vision and Large Language Models will be utilized to create the appropriate embedding spaces for each modality present in the documents. Furthermore, we aim to leverage pre-training on large datasets of text images, which require specific scales to avoid distorting and losing quality in the text. To examine the quality of the generation, we aim to conduct extensive evaluation experiments on downstream tasks, incorporating the generated samples. Additionally, we plan to introduce and benchmark a large-scale synthetic dataset of million samples using our proposed methods, to showcase the generation effectiveness and applicability on Document Image Analysis tasks and revolutionize the field. We aim to submit the findings of this project in top-tier conferences and journals.