Offline2Online Reinforcement Learning
Title: Offline2Online Reinforcement Learning
DNr: Berzelius-2024-86
Project Type: LiU Berzelius
Principal Investigator: Ruoqi Zhang <>
Affiliation: Uppsala universitet
Duration: 2024-04-01 – 2024-10-01
Classification: 20202


Offline-to-online reinforcement learning is a paradigm that combines both offline and online phases within the reinforcement learning (RL) framework. In the offline phase, the model is trained on a pre-existing dataset of experiences or interactions, without any further interaction with the environment. After the offline training phase, the model is then further trained or fine-tuned through online interactions with the environment. The online phase enables the model to adapt to changes in the environment, correct any biases or inaccuracies from the offline data, and improve its performance by learning from new experiences. While offline RL allows for learning from previously collected data without additional environment interaction, efficiently utilizing this data can be challenging. Furthermore, once the policy transitions to the online phase, it must quickly adapt to the new environment using as few interactions as possible to avoid costly mistakes. Achieving high sample efficiency in both phases is crucial but difficult. Our idea is to train a distributional actor-free agent with value functions that is convex with respect to the action. The actual action to take can be solved by the convex optimization method. By changing the parameters, the policy enables a facile adjustment of the policy's orientation, allowing it to span from conservative to optimistic stances based on learned quantiles. This inherent adaptability enhances the agent's efficiency, particularly in the context of transitioning to online learning environments.