Document Analysis
Title: Document Analysis
DNr: Berzelius-2025-285
Project Type: LiU Berzelius
Principal Investigator: Elisa Hope Barney Smith <elisa.barney@ltu.se>
Affiliation: Luleå tekniska universitet
Duration: 2025-09-26 – 2026-04-01
Classification: 10207
Homepage: https://www.ltu.se/en/research/research-subjects/machine-learning
Keywords:

Abstract

This proposal addresses text recognition around multiple projects: (1) Handwritten text recognition (HTR) remains challenging due to the variety of writing styles and languages. The project aims to (1a) extend the current state-of-the-art HTR method (Convolutional Recurrent Neural Network) with multi-task learning. We expect to improve performance on modern and historical text especially in low resource data scenarios (few training data). We’ll scale the recognition from line to page level using line segmentation. This will involve multiple techniques. We will improve Hi-SAM for historical documents focusing on replacing the image binarization component, which performs poorly on historical data, and adapt and extend the Hi-SAM model for few-shot and zero-shot learning, enabling robust performance with minimal annotated data. (1b) improve HTR for Arabic and languages that use the Arabic script (e.g. Persian and Urdu). Challenges include cursive and context-dependent nature, the critical role of diacritic dots in distinguishing letters, and the frequent presence of missing, merged, or degraded dots in handwriting. We’ll combine systematic error analysis with the development of advanced recognition models. (2) Open-set text recognition (OSTR) entails two additional capabilities: i) Reject unseen characters (not in training) from the datastream. ii) Recognize the character once a template is given. Our earlier models can recognize minor scripts that Gemini (by Google) cannot read, like the Galgolitic and Yi scripts. We plan to exploit multistep routing, which sends input data to different encoders, decoders, and classification heads. Later we’ll shift the core network from DAN to a modded Gemma3-4b-it. We aim to merge this project into the VQA project as a downstream task. (3) Vision Question Answering (VQA) was originally inquiring certain information about the image, but now also blankets many vision tasks like OCR and Object Detection. We plan to modify Gemma3-4b-it to accept external knowledge through prompt finetuning, data-driven patch sampler through STN, and zero-shot generated token/embedding pairs through the open-set classifier from the OSTR project for page level OCR tasks, including Norrhandv3 and later some zero-shot and openset data. The modded gemma3 is currently partially working and being tested on a local server. We will move to Berzelius when ready for large-scale training. (4) Authorship Analysis (AAnal) examines the characteristic features of a piece of text to determine its authorship. Rooted in stylometry AAnal includes author attribution, author verification, author profiling and authorship detection. We plan to expand author attribution and verification to prominent authors from previous centuries, taking into account authorial style, linguistic features, vocabulary, and other relevant characteristics. It should also allow comparison of authors in general historical documents. We will use machine learning, deep learning models and Large language models (LLMs) for analysis and to determine authors for unknown manuscripts and scribes. This will address open questions in author attribution, specifically in historic documents and we will try to find the authors of unknown manuscripts and documents using ML, DL andLLMs.