Environment-Agnostic Autotelic Goal-Conditioned Reinforcement Learning
Abstract
This project aims to develop new methods for environment‑agnostic autotelic goal‑conditioned reinforcement learning (GCRL). Autotelic agents—those capable of generating and pursuing self‑selected goals—are a promising direction summarized in Autotelic Agents with Intrinsically Motivated Goal‑Conditioned Reinforcement Learning: A Short Survey. However, existing methods remain limited by environment‑specific representations, unstable large‑scale goal generation, and poor cross‑domain generalization. Addressing these limitations is central to advancing general‑purpose autonomous learning, aligning with core WASP research priorities.
The project builds on recent contrastive RL work—including Contrastive Learning as GCRL, Accelerating Goal‑Conditioned RL, and 1000‑Layer Networks for Self‑Supervised RL (NeurIPS 2025 Best Paper). These methods demonstrate the potential of contrastive objectives and deeper architectures, but they lack the stable autotelic mechanisms and environment‑agnostic representations needed for generalizable skill acquisition.
The goal is to develop a scalable, autotelic GCRL framework that (1) learns environment‑invariant latent spaces, (2) autonomously proposes and regulates goals, and (3) integrates efficiently with vectorized simulation and lightweight contrastive architectures (4–8 layers, 256–512 units). Early‑stage experiments will use these smaller models to rapidly iterate on algorithmic components and training stability, before scaling to deeper architectures in future project stages. The intended primary publication venue is RLC 2026 (5th of March deadline), with CoRL 2026 (29th of May deadline) and NeurIPS 2026 (15th of May expected deadline) as secondary targets.
Fast access to Berzelius—specifically an initial fast‑track allocation on Ampere—is essential for rapid prototyping, ablation studies, and baseline reproduction. JAX‑based vectorized training allows 4096 simulated environments per GPU and supports multiple concurrent experiments within modest GPU memory, making this early phase highly compute‑efficient. The project is led by a WASP‑funded PhD student under a WASP‑affiliated PI and involves method development with no sensitive data, fully aligned with KAW requirements.