Towards Detailed Visual Understanding via Large-scale Vision-Language Models
Title: |
Towards Detailed Visual Understanding via Large-scale Vision-Language Models |
DNr: |
Berzelius-2024-300 |
Project Type: |
LiU Berzelius |
Principal Investigator: |
Fahad Khan <fahad.khan@liu.se> |
Affiliation: |
Linköpings universitet |
Duration: |
2024-09-01 – 2025-03-01 |
Classification: |
10207 |
Keywords: |
|
Abstract
Significant advancements in the field of computer vision have recently been observed due to the development of many foundational vision-language models. These models represent a significant leap towards creating general-purpose vision models capable of tackling various tasks simultaneously. To this end, conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.
While there have been initial attempts for image-based conversation models, this project addresses advancing further image-based foundational models as well as the underexplored field of video-based multimodal foundation models. The aim is to develop novel approaches that efficiently combine the capabilities of LLMs with a pretrained visual encoder adapted for multimodal spatial or spatiotemporal representations. The project further aims to explore a semi-automatic annotation framework for generation high quality instruction data for images, videos and text. Moreover, the aim is to look into developing novel and efficient transformers-based architectures that serve as a key building block behind these recent large vision-language foundation models. Such new novel proposed methods are expected to hava profound impact in many real-world applications ranging from healthcare to intelligent autnomous systems.