SELLMA - The Swedish Large Language Model Arena
Abstract
The Swedish LLM Arena is a KAW-funded project through WASP and WARA-TRICS that will establish a national capability for fine-tuning, evaluating, and governing large language models for Swedish language, society, and culture. Current Swedish use of LLMs depends primarily on commercial systems from non-European providers. While these are strong general-purpose models, they provide limited transparency, create vendor lock-in, transfer sensitive interaction data outside Europe, and are not optimized for Swedish linguistic nuance, institutional context, political culture, media history, or public-sector usage. SELLMA addresses this gap by building a sustainable Swedish pipeline for adapting open-weight models using both high-quality open Swedish data and legally governed copyrighted or closed Swedish data from publishers and other rightsholders.
The project will focus on continued pre-training, supervised fine-tuning, preference optimization, benchmarking, and validation of Swedish LLMs. We will target both smaller models, such as 8B-parameter systems suitable for rapid experimentation and efficient inference, and larger models around 70-120B parameters, where Swedish adaptation can deliver higher-quality language, reasoning, and contextual performance. The goal is not to train a foundation model from scratch, but to use industrial-scale GPU resources to adapt strong existing models with Swedish data and systematic evaluation.
A central scientific and societal contribution of SELLMA is the development of legally safe and reproducible workflows for using Swedish data. Open sources will include Swedish Wikipedia, Riksdag open data, public reports, cultural archives, scientific repositories, and other free-to-use corpora. In parallel, the project will work with publishers, media companies, and other rightsholders to enable controlled use of closed copyrighted Swedish material, including newspapers, books, magazines, broadcast archives, and other high-value sources. These data are essential for capturing modern Swedish public discourse, idiom, cultural references, institutions, and historical context that are poorly represented in generic multilingual training sets.
The requested NAISS/Berzelius resources are required for the compute-intensive parts of this workflow: large-scale data preparation, tokenizer and corpus validation, LoRA infrastructure tests, continued pre-training on multi-billion-token Swedish corpora, supervised fine-tuning, DPO/alignment experiments, inference benchmarking, ablations over data mixtures, and repeated evaluation of both small and large models.
The expected outcomes are trained Swedish LLM checkpoints, documented training and evaluation pipelines, benchmark suites for Swedish language and cultural competence, and operational expertise for future national LLM efforts. SELLMA will also establish governance models for licensing, rightsholder compensation, academic use, commercial exploitation, and long-term maintenance. The project therefore directly supports Swedish AI sovereignty, strengthens national competence in large-scale AI training, and creates reusable infrastructure for academia, industry, media, public agencies, and future European collaborations.