Self-supervised learning-Robust Foundation Models against Robustness and Corruptions
Title: Self-supervised learning-Robust Foundation Models against Robustness and Corruptions
DNr: Berzelius-2024-47
Project Type: LiU Berzelius
Principal Investigator: Rajkumar Saini <>
Affiliation: Luleå tekniska universitet
Duration: 2024-02-28 – 2024-09-01
Classification: 10207


This project seeks to advance the field of computer vision by significantly enhancing the robustness and generalization capabilities of foundational vision models across a diverse range of tasks. At the heart of this endeavour is the development and implementation of a novel paradigm that systematically incorporates both distortions [1] (such as radial and perspective transformations) and various forms of corruptions into the self-supervised learning process. The underlying hypothesis is that by exposing these models to a wider spectrum of visual challenges during training, they can develop more versatile and resilient feature representations. This approach is not just heuristic but is firmly rooted in and guided by our recent theoretical findings. These findings suggest that a model’s ability to understand and adapt to such distortions and corruptions is crucial for its generalizability and robustness in real-world scenarios, where data often deviates from idealized conditions. This project aims to empirically validate these theoretical insights by rigorously testing the models trained under this new paradigm across standard benchmarks and application-specific tasks. The expected outcome is a significant leap in the performance and reliability of computer vision models, especially in uncontrolled and variable environments, marking a substantial contribution to the field of self-supervised learning in computer vision. Self-supervised methods considered: • I-JEPA [2]: The Image-based Joint-Embedding Predictive Architecture (I-JEPA) is a state-of-the-art self-supervised learning method focusing on joint embedding and prediction tasks to enhance image understanding and representation. • SimCLR (Contrastive Learning) [3]: SimCLR, or Simple Framework for Contrastive Learning of Visual Representations, utilizes contrastive learning techniques to learn visual representations by maximizing agreement between differently augmented views of the same image in the latent space. • DINO (Knowledge Distillation) [4]: DINO, short for Self-Supervised Learning of Vision Transformers with Distillation, employs a self-distillation approach where a student network learns to mimic the output of a teacher network, both of which are Vision Transformers (ViTs), without using labelled data.