Representation Learning for Machine Learning and Robotics
Abstract
Our project aims to explore the challenges and possibilities of modern representation learning, focusing on large-scale computation and data requirements. In particular, our project focuses on two different tracks: (I) general machine learning tasks; and (II) robotics applications.
Regarding (I), we plan to explore three different aspects of representation learning for machine learning tasks: (a) the type of data; (b) the model architecture; (c) the downstream application. For (I-a) we focus on the problem of the rapid advancements of models for multimodal data and the computational and environmental cost of training such pipelines. In particular, we address the question how to efficiently update modules of pretrained task pipelines without degrading their quality. To do so, we plan on exploring several recently proposed methods for sample-efficient adaptation and representation alignment of large-scale models, such as LoRAs [1], relative representations [2], and semantic alignment [3], expanding their use to the multimodal setting. For (I-b) we will explore novel classes of deep generative models, such as Idempotent Generative Models (IGM [4]), deepening the understanding of their training dynamics, mitigating mode collapse issues. For (I-c) we plan on exploring the use of large language models (LLM) and pretrained large-scale models for Brain-Computer-Interface (BCI) applications. In particular, we will focus on the use and adaptation of large pretrained models to enhance their prediction capability of human perceptual experience (such as odor prediction); and the evaluation of the alignment between representations encoded by large pretrained models and human experience when provided with the same stimuli.
Regarding (II), we will focus on three components of robotic learning: (a) multimodal sensory inputs, (b) embodiment transfer, and (c) robustness. For (II-a), we plan to adapt large pre-trained foundation models such as [5,6] to process multimodal sensory inputs such as videos, text, images, language, sound, and force-torque data. Especially the force-torque data have been ignored despite being a vital input modality for completing many robotics tasks such as cutting, wiping, and scooping. For (II-b), we plan to expand the pre-trained robotics foundation models to new types of hardware, including dexterous robotic hands and our in-house grippers. To achieve embodiment transfer, we will first need to collect new large-scale datasets using these grippers and then sample-efficiently adapt the current foundation models using, e.g., LoRA [1] on these datasets. For (II-c), we plan to explore new network structures for training robotic policies that are robust to distractors or changes in hand-eye configurations. We first need to disentangle the roles of high-level, global semantic-driven macro-motion and the fine-motion adjustments that are based on detailed, local interactions between the robot and objects. Subsequently, we will design new algorithms that effectively integrate these macro and fine motion strategies into a cohesive policy generation framework.