(WASP project) Beyond supervised learning for semantic analysis of visual data
||(WASP project) Beyond supervised learning for semantic analysis of visual data|
||Sebastian Bujwid <email@example.com>|
||Kungliga Tekniska högskolan|
||2023-01-03 – 2023-08-01|
Deep neural networks have been shown to be highly effective models in domains, such as image recognition, natural language processing, speech recognition. However, such models, as well as tasks, are usually defined in a single modality (vision, language, speech, etc.).
Undeniably, different modalities are complementary to each other - e.g. images can describe objects that symbols in language refer to, text can define abstract relations between observed objects (e.g. *dogs and cats are domestic animals*). Despite that, deep learning models are rarely designed to rely on information or knowledge from more than one modality.
We believe that connecting information from different modalities is an important building block for further developing progress on artificial intelligence. Therefore, this project will focus on knowledge transfer between language and vision - two of the most popular domains of deep learning.
The goal of the project is to study the potential benefits of connecting information contained in text and image data. The plan is to demonstrate them on tasks such as zero-shot learning, where we want to recognize images of classes unseen during training, relying on textual descriptions of those classes since language can succinctly describe and define concepts. Models trained in this way would have a practical advantage of being able to reason about classes of objects they have not observed before. That could potentially reduce the reliability of deep neural networks on large numbers of explicitly labeled samples.
Another task that we intend to study is visually grounding language, where we want to use information in images to help natural language processing models to learn better representations of language.
We hope that the visual appearance of objects, as well as the difference between textual and visual context, convey additional information to co-occurrence relations between symbols that most natural language processing models rely on.