Closing the Modality Gap of CLIP embedding space
||Closing the Modality Gap of CLIP embedding space|
||Peiyang Shi <firstname.lastname@example.org>|
||Kungliga Tekniska högskolan|
||2023-03-03 – 2023-10-01|
CLIP, a popular pretrained model for text and image understanding, has been widely adopted in various downstream tasks, including diffusion models, dataset filtering, video and audio understanding. However, recent studies have shown the existence of a modality gap within CLIP, which affects its overall performance. In this work, we propose a novel method to efficiently close this modality gap. Our approach involves removing positive samples from negative samples in the loss function, which results in a significant reduction in the pairwise distance between images and text. Our preliminary results on a validation set indicate that our approach performs better than CLIP and converges faster. Further, we plan to train a downstream classifier to evaluate the quality of the representation. Overall, our proposed method has the potential to enhance the performance of CLIP and provide better results for various downstream tasks.