Planning to Optimize and Learn Reward in Navigation Tasks in Structured Environments with Time Constraints - Robotics Institute Carnegie Mellon University

Planning to Optimize and Learn Reward in Navigation Tasks in Structured Environments with Time Constraints

PhD Thesis, Tech. Report, CMU-RI-TR-21-53, Robotics Institute, Carnegie Mellon University, August, 2021

Abstract

Planning problems in which an agent must perform tasks for reward by navigating its environment while constrained by time and location have a wide variety of applications in robotics. Many real-world environments in which such planning problems apply, such as office buildings or city streets, are very structured. They consist of passages with notable locations along them such as streets or hallways, and notable locations might be grouped into clusters such as floors in a building divided by bottlenecks such as elevators.

In this thesis, we introduce algorithms designed for an agent performing tasks for reward while constrained by time and location in a structured environment. Our goal is specifically to exploit the structure of the environment when planning in order to improve the reward received. We present three algorithms that do so in different ways. First, we present the Task Graph algorithm, which groups actions the agent can perform into larger "tasks" and creates a plan from a sequence of tasks, rather than individual actions. We test the algorithm in a problem we call "Exploration Scheduling," in which an agent seeks to gather information in the free time between mandatory tasks in order to improve the reward received from the mandatory tasks, and find it yields strong results.

Second, we present the RegionPlan algorithm, which divides the environment into regions based on its structure and creates a separate plan for each region to choose from. We test the RegionPlan algorithm’s ability to solve a variant of the orienteering problem in simulated structured environments, and demonstrate environment structures and reward distributions in which the RegionPlan approach allows the agent to avoid getting stuck at certain locally-optimal solutions.

Third, we present Route-Based Multi-Level Variable Neighborhood Search (RouteMLVNS), a variable neighborhood search algorithm that represents solutions to the problem as routes the agent takes through the environment, rather than sequences of actions. In this way, Route-MLVNS takes into account the locations that the agent passes by as it travels when evaluating the reward of a location, rather than only considering the reward of traveling to a single location at a time. We test Route-MLVNS in a simulation of the orienteering problem in structured environments, comparing it to an MLVNS algorithm from the literature that treats solutions as ordinary plans of tasks, and find Route-MLVNS to yield superior results. We also demonstrate a unique capability that Route-MLVNS has, the ability to solve the orienteering problem for a continuous range of time constraints simultaneously rather than only for a single time constraint or discrete set of time constraints.

Finally, we consider scenarios in which an agent solves repeated instances of the orienteering problem in a structured environment with stochastic rewards that are initially unknown. We present algorithms which treat locations in the environment as levers in a multi-armed bandit problem and assign them values based on both the expected reward and the potential value of the knowledge gained by visiting them, and use planning algorithms based on the environment structure to find a plan that maximizes the assigned value received. We perform tests demonstrating the algorithms’ effectiveness at both receiving reward over time and learning an effective model. Additionally, we analyze the relationship between the structure of the environment and the distribution of the data received by the agent.

Overall, our work demonstrates the benefit of an agent considering and exploiting the structure of its environment when planning tasks, and how the structure affects the data it can gather in environments with unknown stochastic rewards.

BibTeX

@phdthesis{Korein-2021-129420,
author = {Max Korein},
title = {Planning to Optimize and Learn Reward in Navigation Tasks in Structured Environments with Time Constraints},
year = {2021},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-21-53},
keywords = {Task Planning, Orienteering Problem, Structured Environments, Reinforcement Learning},
}