/Dual Policy Iteration

Dual Policy Iteration

Wen Sun, Geoff Gordon, Byron Boots and Drew Bagnell
Tech. Report, CMU-RI-TR-18-08, April, 2018

Download Publication (PDF)

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.


Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from Anthony et al. (2017)). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead, but is only available during training. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches for applications with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes (MDPs).

BibTeX Reference
author = {Wen Sun and Geoff Gordon and Byron Boots and Drew Bagnell},
title = {Dual Policy Iteration},
year = {2018},
month = {April},
institution = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-18-08},
keywords = {Reinforcement Learning, Imitation Learning, Model-Based Control},