Document Analysis
Title: Document Analysis
DNr: Berzelius-2024-82
Project Type: LiU Berzelius
Principal Investigator: Elisa Hope Barney Smith <>
Affiliation: Luleå tekniska universitet
Duration: 2024-02-27 – 2024-09-01
Classification: 10207


This proposal addresses the task of text recognition around multiple projects: (1) Handwriting recognition: We have many ideas we will explore. Handwriting recognition is the act of associating text with a written outline representing a letter, a word, sentence or a paragraph. This task remains challenging due to each person's different writing style and language-related difficulties to recognize. The current state-of-the art use deep learning models that need a lot of computing resources for the training. The projects aim to design an efficient handwriting recognition model by extending the current state-of-the-art methods (Convolutional Recurrent Neural Network, Vertical Attention Network …) with new attention mechanisms and methods of multi-task learning. In addition to handwriting in multiple scripts, we will be interested in handwriting that may contain erasures, subtexts and overwrites. We will start by focusing on the essays in Brazilian Portuguese and then recognize handwriting in other languages. The expected objectives at the end of the project are to improve performance in handwriting recognition, especially for text containing erasures. (2) Open-set text recognition, which aims to address the recognition problem of transcribing samples with potentially unknown characters from various languages, faces the challenge of diverse writing directions, styles, spacing, and writing systems. Existing methods tend to train one model for all scripts, or simply have a dedicated model for each language/trait. Encouraged by the recent success of dynamic routing capabilities like in the Natural Language Processing (NLP) field, Project watch-and-act aims to systematically implement and compare various routing on text recognition tasks. Note at stage 0 it does not seek full control from the LLM. (3) Ensemble Learning: Ensemble learning has shown vast success in past text recognition and detection efforts. While large-scale ensemble methods are slow in execution, they serve as good pseudo-label annotators. However, many existing methods in text detection and recognition are based on voting according to the results themselves, ignoring the visual clues from the image. To overcome these limitations, we propose to use MLM to assist the ensembling process, by selecting models to use and results to take based on a thumbnail of the input image. (4) VQA Project: Recent MLMs have shown significant power in understanding text and visual concepts, however, they are yet to have a strong capability to perform detailed OCR tasks, which limits their capabilities to performing detailed and precise document VQA tasks. On the other hand, SOTA docVQA tasks like HiVT5, have a strictly separated OCR and reasoning module, potentially prone to error accumulation. Hence, we propose to frame docVQA into a multi-step decision-making paradigm, where text detection, recognition, image feature extraction, and answering are all potential actions of an LLM. This approach allows the model to flexibly focus and read on question-related regions while allocating fewer computation resources to other regions (by using coarser patches), which should achieve a better balance between computation resources and performance.