Exploring Safe Reinforcement Learning for Sequential Decision Making

Master's Thesis, Tech. Report, CMU-RI-TR-23-22, June, 2023

View Publication

Abstract

Safe Reinforcement Learning (RL) focuses on the problem of training a policy to maximize the reward while ensuring safety. It is an important step towards applying RL to safety-critical real-world applications. However, safe RL is challenging due to the trade-off between the two objectives of maximizing the reward and satisfying the safety constraints, which could lead to unstable training and over-conservative behaviors.

In this thesis, we propose two methods of solving the issues mentioned above in safe RL:
(1) We propose Self-paced Safe Reinforcement Learning which combines a self-paced curriculum on the safety objective with a base safe RL algorithm PPO-Lagrangian. During training, the policy starts with easy safety constraints and gradually increases the difficulty of the constraints until the desired constraints are satisfied. We evaluate our algorithm on the Safety Gym benchmark and demonstrate that the curriculum helps the underlying Safe RL algorithm to avoid local optima and improves the performance for both reward and safety objectives.
(2) We propose to learn a policy in a modified MDP in which the safety constraints are embedded into the action space. In this ``safety-embedded MDP," the output of the RL agent is transformed into a sequence of actions using a trajectory optimizer that is textit{guaranteed} to be safe, under assumption that the robot is currently in a safe and quasi-static configuration.
We evaluate our method in the Safety Gym benchmark and show that we achieve significantly higher rewards and fewer safety violations during training than previous work; further, we have no safety violations during inference.
We also evaluate our method on a real robot box-pushing task and demonstrate that our method can be safely deployed in the real world.

BibTeX

@mastersthesis{Yang-2023-136124,
author = {Fan Yang},
title = {Exploring Safe Reinforcement Learning for Sequential Decision Making},
year = {2023},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-22},
keywords = {Robotics, Reinforcement Learning, Safe Reinforcement Learning, Safety, Machine Learning, Curriculum, Trajectory Optimization, Constraint},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.