A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions - Robotics Institute Carnegie Mellon University

A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

Remi Munos
Journal Article, Machine Learning, Vol. 40, No. 3, pp. 265 - 299, September, 2000

Abstract

This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control. In order to solve the HJB equation, we use the powerful framework of viscosity solutions and present the main result that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods permit to discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a ``strong'' contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, that learns ``from experience'', the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only ``approximations'' (in a sense of satisfying some ``weak'' contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the ``Car on the Hill'' problem.

BibTeX

@article{Munos-2000-16684,
author = {Remi Munos},
title = {A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions},
journal = {Machine Learning},
year = {2000},
month = {September},
volume = {40},
number = {3},
pages = {265 - 299},
}