Multimodal transformer for speech, text and images
Title: Multimodal transformer for speech, text and images
SNIC Project: Berzelius-2022-44
Project Type: LiU Berzelius
Principal Investigator: Birger Moell <bmoell@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2022-04-25 – 2022-11-01
Classification: 10208
Homepage: https://www.kth.se/is/tmh/division-of-speech-music-and-hearing-1.780110
Keywords:

Abstract

Our goal is to train a Swedish data2vec model, a multimodal transformer model for text, speech and images. We also aim to make the model available open source. We believe that a multimodal transformer model has the potential to be a next step towards more generalisable models and can be useful for the Swedish research community and general public. Link to data2vec paper https://arxiv.org/abs/2202.03555