Training large-scale language models for controllable text generation in Swedish
The ultimate goal of the larger project (Quizmaster, linked above) is to generate multiple-choice reading comprehension questions in Swedish, targeted at adult language learners of Swedish. To this end it's important to have texts of suitable complexity and genre. Due to copyright reasons, such texts are not readily available, which is why we opt for generating such texts using Transformer-based language models.
In order to achieve the aforementioned goal, we need to train a large-scale language model, capable of controllable text generation in Swedish (and possibly languages other than English). The architecture we are aiming for is largely inspired by the CTRL language model (https://arxiv.org/pdf/1909.05858.pdf). We aim to train on the Swedish part of the publicly available mC4 corpus, texts from Project Guthenberg and SweQUAD-MC (also publicly available).