Self-supervised Stage Visual Transformers
Title: Self-supervised Stage Visual Transformers
SNIC Project: Berzelius-2021-56
Project Type: LiU Berzelius
Principal Investigator: Hao Hu <>
Affiliation: Kungliga Tekniska högskolan
Duration: 2021-10-08 – 2022-02-01
Classification: 10207


Deep neural networks have achieved supreme success in various computer vision tasks in the past decade. Although recent research trends indicate that attention-based models such as visual transformers, have more potential than conventional convolutional neural networks (CNN) in reaching better performances, their nature of being hungry for large amount of labeled training data becomes a major obstacle for further applications. This project aims to alleviate such problem by exploring ways of training attention-based models in self-supervised settings, such that networks can still achieve competitive performances with significant less or even little labeled training data. In big data era, it becomes easy to collect large amount of data from different sources, which makes training deep neural networks on large scale datasets possible. However, it is much harder to acquire similar amount of corresponding labels at the same time due to various reasons like expensive human labour, lack of expertise, etc. With such factors in mind, we can highlight the importance of our project from two aspects. First, as an emerging topic from AI domain, self-supervised learning has shown its potential do be an imperative solution of leveraging large-scale unlabeled training data. On the other hand, recently developed models such as visual transformers imply that huge data volume is still the key to build powerful AI systems. Thus, there are great chances that our project can make fundamental contributions for improving the state-of-the-art deep learning models with significantly less labeled data. Second, considering the excellent transferabilities demoed by latest self-supervised models, we believe the methodologies developed by our project are also with great generalization abilities, which means they can be easily adopted to diverse downstream task while still keep the performance. This brings additional values for specific tasks and real world applications, where available labeled data is usually limited. In our project, we will develop novel approaches in terms of both network architectures and self-supervised training schemes. More specifically, we are going to improve the architecture of the latest visual transformers by incorporating newly designed attention-based modules, and train the improved networks with new objectives that fit for the attention modules. We will test developed methods by training them on different large-scale classification benchmarks including ImageNet without labels. Besides, we also plan to evaluate the transferabilities of developed methods by fine-tuning them on fine-grained classification benchmarks such as CUB-birds, etc. There are also some additional ablation study-related experiments included in the evaluation. We expect all of these experimental evaluations can be supported by Berzelius computational resources. We also expect to submit our methods and evaluation results to one or two top venues for publication at the end of the project. All the data used for evaluation will be from public benchmarks.