Compositional Generalisation for Visual Question Answering - building dataset and training models
||Compositional Generalisation for Visual Question Answering - building dataset and training models|
||Adam Dahlgren Lindström <email@example.com>|
||2023-06-06 – 2024-01-01|
Compositional generalisation is the capability to understand and create an infinite amount of novel expressions from a finite set of components. It is a central component of generalisation, and something deep learning models struggle with. In order to test this capability in models, we can construct training/test splits where certain combinations are held out from training, so that.
For language, this can mean that a certain word is never seen as the object of a sentence, only the subject. For vision, it might be that no red spheres are seen in training but other red objects. The splits are then created such that good performance on the test set indicates that the model can independently compose e.g. object properties. Compositional generalisation is important in the effort to achieve more data efficient and robust models.
The project is building a benchmark dataset for compositional generalisation in visual question answering. The project has now constructed most of the benchmark and is in the phase of comparing the performance of different multimodal transformers and neuro-symbolic methods. The benchmark is built using synthetic data from the CLEVR domain.
Berzelius is needed to scale up the evaluation of the large transformer models.