Self-supervised Visual Transformers with Resolution Consistency for Remote Sensing
||Self-supervised Visual Transformers with Resolution Consistency for Remote Sensing|
||Hao Hu <email@example.com>|
||Kungliga Tekniska högskolan|
||2022-03-07 – 2022-10-01|
In the past decade deep learning has achieved great success in computer vision due to explosive amount of visual data. Such achievements demonstrate the necessity and importance of collecting large scale datasets in order to train those data-hungry neural networks, which could be problematic for certain specific areas. The challenges are twofold. First, collecting large amount of data is practically prohibitive for many real-world applications. Second, most deep learning methods rely on labeled data, whose acquisition could be costly and impractical. To overcome such challenges, people are turning their attentions to the transfer learning and self-supervised learning. The former one tries to learn to produce general data presentations without the guidance of labels, while the later one focuses on adapting pretrained neural networks from a source domain to a target domain such that the target data can also benefit from the general knowledge learnt from the source dataset.
Noticed by their success in computer vision, deep neural networks are also increasingly favored by the remote sensing community. Thanks to public satellites such as Sentinel 2, it is become possible to acquire significantly more Earth Observation (EO) images than it was ever before, accelerating the research of applying deep learning-based approaches to remote sensing domain. Considering that most of remote sensing tasks normally require pixel level outputs, it could be extremely costly to build a large EO dataset for training networks with label supervision. Thus, combining transfer learning and self-supervised learning becomes a feasible solution. However, EO data also brings its own challenges, one of which could be the image resolution. Since pixels of EO data need to represent objects with enough details, each image can have a very high resolution, which prevents them from fitting the GPU memories for training or leads to impractical training time.
In the context of self-supervised transfer learning, we tackle this challenge by using different resolutions for self-supervised pretraining and supervised fine-tuning, respectively. That is, we resize the large-scale source images into a smaller resolution for self-supervised pretraining such that they can be fitted for affordable GPU hardware. Then with the pretrained network we fine-tune it on the small target dataset with the high resolution to make sure the pixel-wise output still contains enough details. To achieve this, we propose two improvements over the commonly used approaches. First, we use Visual Transformers (ViTs) as our neural network backbones to avoid the disagreement between pretraining and fine-tuning input resolutions. Second, we impose a consistency loss between representations of low- and high-resolution images to make sure they are representing the same global contents. In our experiments, we will pretrain the ViT on the large-scale Functional Map of the World (fMoW) Dataset using the state-of-the-art self-supervised learning, then fine-tune it on xView 2 dataset for building detections. We expect that both improvements can lead to higher performance on the target dataset.