Learning Off-Policy with Online Planning - Robotics Institute Carnegie Mellon University

Learning Off-Policy with Online Planning

Harshit Sikchi, Wenxuan Zhou, and David Held
Workshop Paper, ICML '20 Inductive Biases, Invariances and Generalization in Reinforcement Learning Workshop, July, 2020

Abstract

We propose Learning Off-Policy with Online Planning (LOOP), combining the techniques from model-based and model-free reinforcement learning algorithms. The agent learns a model of the environment, and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed horizon trajectory optimization, a value function is attached to the end of the planning horizon. This value function is learned through off-policy reinforcement learning, using trajectory optimization as its behavior policy. Furthermore, we introduce "actor-guided'' trajectory optimization to mitigate the actor-divergence issue in the proposed method. We benchmark our methods on continuous control tasks and demonstrate that it offers a significant improvement over the underlying model-based and model-free algorithms.

BibTeX

@workshop{Sikchi-2020-125595,
author = {Harshit Sikchi and Wenxuan Zhou and David Held},
title = {Learning Off-Policy with Online Planning},
booktitle = {Proceedings of ICML '20 Inductive Biases, Invariances and Generalization in Reinforcement Learning Workshop},
year = {2020},
month = {July},
keywords = {Online Planning, Model-based Reinforcement Learning, Trajectory Optimization, Reinforcement Learning},
}