Towards Detailed Visual Understanding via Large-scale Vision-Language Models

System

NSC Web

Front Page

Getting Access

Support Email

support@nsc.liu.se

Feedback

Give Feedback

Towards Detailed Visual Understanding via Large-scale Vision-Language Models

Title:	Towards Detailed Visual Understanding via Large-scale Vision-Language Models
DNr:	Berzelius-2023-191
Project Type:	LiU Berzelius
Principal Investigator:	Fahad Khan <fahad.khan@liu.se>
Affiliation:	Linköpings universitet
Duration:	2023-08-28 – 2024-02-02
Classification:	10207
Keywords:

Abstract

Significant advancements in the field of computer vision have recently been observed due to the development of many foundational vision-language models. These models represent a significant leap towards creating general-purpose vision models capable of tackling various tasks simultaneously. To this end, conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this project addresses the underexplored field of video-based conversation models by developing novel approaches that combine the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representations. The project further aims to explore a semi-automatic annotation framework for generation high quality instruction data for images and videos. Moreover, the aim is to look into developing novel and efficient transformers-based architectures that serve as a key building block behind these recent vision-language models. Such new proposed methods are expected to hava profound impact in many real-world applications ranging from healthcare to intelligent autnomous systems.

National Supercomputer Centre at Linköping University

Abstract