Visual Imitation Learning for Robot Manipulation - The Robotics Institute Carnegie Mellon University
Home/Visual Imitation Learning for Robot Manipulation

Visual Imitation Learning for Robot Manipulation

Master's Thesis, Tech. Report, CMU-RI-TR-19-07, Robotics Institute, Carnegie Mellon University, May, 2019
Download Publication


Imitation learning has been successfully applied to solve a variety of tasks in complex domains where an explicit reward function is not available. However, most imitation learning methods require access to the robot's actions during demonstration. This stands in a stark contrast to how we humans imitate: we acquire new skills by simply observing other humans perform a task, mostly relying on the visual information of the scene.

In this thesis, we examine how we can endow a robotic agent with this capability, i.e., how to acquire a new skill via visual imitation learning. A key challenge in learning from raw visual inputs is to extract meaningful information from the input scene, and enabling the robotic agent to learn the demonstrated skill based on the accessible input data. We present a framework that encodes the visual input of a scene into a factorized graph representation, casting one-shot visual imitation of manipulation skills as a visual correspondence learning problem. We show how we detect corresponding visual entities of various granularities in both the demonstration video and the visual input of the learner during the imitation process. We build upon multi-view self-supervised visual feature learning, data augmentation by synthesis, and category-agnostic object detection to learn scene-specific visual entity detectors. We then measure perceptual similarity between demonstration and imitation based on matching of the spatial arrangements of corresponding detected visual entities, encoded as a dynamic graph during demonstration and imitation.

Using different visual entities such as human keypoints and object-centric pixel features, we show how these relational inductive biases regarding object fixations and arrangements can provide accurate perceptual rewards for visual imitation. We show how the proposed image graph encoding drives successful imitation of a variety of manipulation skills within minutes, using a single demonstration and without any environment instrumentation.


author = {Maximilian Sieb},
title = {Visual Imitation Learning for Robot Manipulation},
year = {2019},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-07},
keywords = {Visual Imitation Learning, Representation Learning, Reinforcement Learning, Robot Learning},