Robotic Foundation Models: Vision-Language-Action Frameworks for Generalist Robotics
Title: Robotic Foundation Models: Vision-Language-Action Frameworks for Generalist Robotics
DNr: Berzelius-2025-139
Project Type: LiU Berzelius
Principal Investigator: Martin Magnusson <martin.magnusson@oru.se>
Affiliation: Örebro universitet
Duration: 2025-04-07 – 2025-11-01
Classification: 20201
Keywords:

Abstract

In this project, we propose to develop robotic foundation models that integrate vision, language, and action (VLA) modalities within a unified, generalist robotics framework. The primary objective is to empower robots with the capability to efficiently generalize acquired skills across diverse manipulation tasks and varying environments. This will be achieved by leveraging recent advances in multimodal representation learning and large-scale pretraining. Traditional robotic manipulation often requires specialized controllers or extensive task-specific fine-tuning, severely limiting the scalability and adaptability of robotic systems. Recent developments in Vision-Language-Action models have demonstrated promising potential. These models typically rely on pre-trained vision-language models that are fine-tuned using large-scale demonstration datasets, enabling robots to perform tasks guided by natural language instructions across diverse scenarios. However, a significant limitation remains—the heavy dependence on large-scale, optimal teleoperation demonstration data. To address these challenges, this project explores a novel Vision-Language-Action framework designed to enhance both learning efficiency and generalization capabilities of robotic foundation models. Our approach retains the strengths of VLA models, particularly their unified representation capable of seamlessly interpreting visual inputs, comprehending natural language commands, and performing complex actions. Crucially, our framework significantly reduces reliance on extensive demonstration data. We will leverage large-scale demonstration datasets to pre-train foundational models utilizing the computational capabilities of the Berzelius infrastructure, facilitating scalable and effective representation learning. Following pretraining, models will undergo fine-tuning on a carefully selected set of robotic manipulation tasks. This stage is crucial for assessing and validating the models' generalization performance and their data efficiency. This methodology aims to produce robust and versatile robotic models capable of rapid adaptation to novel situations with minimal additional training data. Achieving the objectives of this project promises substantial advancements in robotics research and industry applications. It will accelerate the deployment of robots capable of effectively performing varied tasks in complex and dynamic real-world environments. Ultimately, the outcomes of this research will lay essential groundwork for the next generation of versatile, intelligent robotic systems.