Sample Efficient Bayesian Optimization for Policy Search: Case Studies in Robotics and Education

Rika Antonova
Master's Thesis, Tech. Report, CMU-RI-TR-16-40, Robotics Institute, Carnegie Mellon University, August, 2016

View Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


In this work we investigate the problem of learning adaptive strategies, called policies, in do- mains where evaluating different policies is costly. We formalize the problem as direct policy search: searching the space of policy parameters to identify policies that perform well with respect to a given objective. Bayesian Optimization is one method suitable for such settings, when sample/data effi- ciency is desired. We use this method as a starting point and present approaches that further improve sample efficiency. To take advantage of domain knowledge we propose an approach for constructing a domain- specific kernel. This construction utilizes a simulator of the underlying model dynamics, but does not require the simulator to perfectly capture the dynamics. We demonstrate the success of this approach on the case of learning bipedal locomotion policies and outline conditions necessary for a similar approach to be useful in other domains. In some settings model-based approaches are weakened by the lack of domain knowledge. How- ever, the task structure could be used to improve sample efficiency in a model-free way. We demon- strate how this can be achieved for the case of learning optimal stopping policies. We propose a re- sampling approach that reuses samples/trajectories for model-free off-policy evaluation. “Off-policy” means we can reuse previously collected samples/trajectories to evaluate new policies. This allows to significantly reduce the number of costly samples collected from the environment during optimiza- tion, while providing a way to evaluate a large number of stopping policies. For our experiments we consider two domains where policy evaluation is costly: bipedal locomo- tion and intelligent tutoring systems. For locomotion we construct a domain specific kernel and use it when optimizing control parameters for a recently developed neuromuscular model in simulation. We demonstrate that our approach substantially reduces the number of costly trials. To evaluate our model-free resampling approach we turn to the domain of education. Here we consider the problem of inferring how many instructional exercises are enough to achieve a learning objective. We show that resampling offers improvements over standard Bayesian Optimization in simulation, and is also effective on real interactions with students/participants.

author = {Rika Antonova},
title = {Sample Efficient Bayesian Optimization for Policy Search: Case Studies in Robotics and Education},
year = {2016},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-16-40},
} 2017-09-13T10:38:19-04:00