Self-supervised Transformer-based Representation Learning for Autonomous Vehicle Vision
Title: Self-supervised Transformer-based Representation Learning for Autonomous Vehicle Vision
DNr: Berzelius-2024-133
Project Type: LiU Berzelius
Principal Investigator: Fredrik Lindsten <>
Affiliation: Linköpings universitet
Duration: 2024-04-01 – 2024-10-01
Classification: 10207


Self-supervised learning (SSL) is the process of pre-training a model on unlabeled data by using artificial labels generated from the data itself. The latent representations learned in this manner on large scale datasets have been shown to transfer well to multiple downstream tasks. On finetuning this model with limited labeled training data, the results have surpassed fully supervised training on the same dataset. In the field of vision, most research has been focused on datasets containing a single dominant object in the image, eg. ImageNet. In autonomous vehicles (AV), large amounts of unlabeled data is readily available but labeling data for specific tasks is expensive. Hence, self-supervised learning is a great research direction for AV vision. However, existing SSL methods are not suitable for AV data which contains complex scenes with multiple objects in each image. Recently, Vision Transformers have shown promising results on vision benchmarks and are inherently good at grouping similar content in images. Based on this motivation, our plan is to explore ways to extend an existing state-of-the-art Transformer-based SSL method called DINO [1] to handle complex scenes. Then, the latent representations learned by this method will be evaluated by transferring them to downstream tasks such as object detection, semantic segmentation and image retrieval. This research has the potential to produce a state-of-the-art pre-trained model for AV vision that can enable easier training of many AV vision models at much lesser compute and requiring much fewer labeled data (thus lowering the labeling costs). The research will be done using code written in Python and Pytorch will be the primary deep learning framework. [1] Emerging Properties in Self-Supervised Vision Transformers (