Growing and Looping for Vision Foundation Models
Title: Growing and Looping for Vision Foundation Models
DNr: Berzelius-2026-119
Project Type: LiU Berzelius
Principal Investigator: Karl Henrik Johansson <kallej@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2026-04-01 – 2026-10-01
Classification: 10201
Homepage: https://www.kth.se/profile/amkm
Keywords:

Abstract

Recent work on large language models has renewed interest in an older idea in deep learning: instead of training a large model in its full form from the beginning, one can start with a smaller model and gradually grow it during training, or loop a smaller set of layers multiple times. These ideas were initially motivated by computational efficiency, but more recently they have also been connected to the more fundamental question of how transformer models use their depth. In language models, several studies have shown that very deep models do not always use all layers efficiently, and that later layers often contribute only limited refinements to the final prediction, a phenomenon sometimes referred to as the Curse of Depth (Sun et al., 2025; Csordás et al., 2025). Related work suggests that similar behavior may also appear in vision transformers, where layers may exhibit repeated or phase-like computations rather than fully distinct processing at every depth (Jacobs et al., 2025; Park et al., 2023; Bhojanapalli et al., 2021; Jiang et al., 2025). In this project, we will study growing and looping (Kapl et al., 2025; Saunshi et al., 2024; Shu et al., 2026) for vision, with a particular focus on modern transformer-based visual models, especially self-supervised ones. Our goal is to understand whether the main observations from language models also hold in vision, and if so, in what form. In particular, we will investigate how growing or looping affects training efficiency, depth utilization, representation learning, and downstream performance across tasks of different granularity, from global classification to dense prediction and multimodal evaluation. We will study representative training objectives including self-distillation, contrastive learning, masked image modeling, and related SSL setups, and analyze multiple growth axes such as depth, width, resolution, scheduling strategy, and modality. More broadly, this project aims to assess whether growing and looping can serve not only as tools for efficiency, but also as useful inductive biases for visual representation learning. This proposal requests an extension of our existing Berzelius allocation. While the previous main project on part–whole hierarchies was followed up only briefly in the last period, the present extension will primarily support the new main project on growing and looping for vision, together with the continuation of the longer-term binder-design side project. Side Project Alongside the main project, we will continue the side project on protein–protein binder design that explicitly targets high-affinity binders. High-affinity protein–protein binders are crucial for real applications such as therapeutics and diagnostics, but today’s pipelines rarely optimize directly for true affinity: typical “success” is often defined as obtaining a binder at all, while binding affinity is either weakly considered or neglected. Furthermore, common in-silico scores such as iPTM, pLDDT, and iPAE do not predict binding affinities well, while available wet-lab binding datasets remain scarce (Danneskiold-Samsøe et al., 2024). Current methods, including iterative hallucination approaches such as BindCraft (Pacesa et al., 2024), diffusion-based design combined with inverse folding (Hayes et al., 2025), and co-generation all-atom models such as Latent-X and related approaches (Chen et al., 2025; Geffner et al., 2025), are promising but either do not yet perform strongly in terms of designing consistently high-affinity binders or lack open affinity-focused validation. In this project, we will continue curating a dataset of wet-lab–validated binders with relative binding-affinity labels, and use it to steer a co-generation all-atom model that jointly designs binder sequence and structure, using supervised fine-tuning and, where beneficial, reinforcement-learning–style objectives. This side project is a longer-term collaborative effort together with collaborators at TU Munich.