The Beta Policy for Continuous Control Reinforcement Learning

Master's Thesis, Tech. Report, CMU-RI-TR-17-38, Robotics Institute, Carnegie Mellon University, June, 2017

View Publication

Abstract

Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world control problems, the actions one can take are bounded by physical constraints, which introduces a bias when the standard Gaussian distribution is used as the stochastic policy. In this work, we propose to use the Beta distribution as an alternative and analyze the bias and variance of the policy gradients of both policies. We show that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization (TRPO) and actor critic with experience replay (ACER), the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym's and MuJoCo's continuous control environments.

BibTeX

@mastersthesis{Chou-2017-26161,
author = {Po-Wei Chou},
title = {The Beta Policy for Continuous Control Reinforcement Learning},
year = {2017},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh PA},
number = {CMU-RI-TR-17-38},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.