Home/Visual Imitation Learning for Robot Manipulation

Visual Imitation Learning for Robot Manipulation

Maximilian Sieb
Master's Thesis, Tech. Report, CMU-RI-TR-19-07, May, 2019

Download Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Imitation learning has been successfully applied to solve a variety of tasks in complex domains where an explicit reward function is not available. However, most imitation learning methods require access to the robot’s actions during demonstration. This stands in a stark contrast to how we humans imitate: we acquire new skills by simply observing other humans perform a task, mostly relying on the visual information of the scene.

In this thesis, we examine how we can endow a robotic agent with this capability, i.e., how to acquire a new skill via visual imitation learning. A key challenge in learning from raw visual inputs is to extract meaningful information from the input scene, and enabling the robotic agent to learn the demonstrated skill based on the accessible input data. We present a framework that encodes the visual input of a scene into a factorized graph representation, casting one-shot visual imitation of manipulation skills as a visual correspondence learning problem. We show how we detect corresponding visual entities of various granularities in both the demonstration video and the visual input of the learner during the imitation process. We build upon multi-view self-supervised visual feature learning, data augmentation by synthesis, and category-agnostic object detection to learn scene-specific visual entity detectors. We then measure perceptual similarity between demonstration and imitation based on matching of the spatial arrangements of corresponding detected visual entities, encoded as a dynamic graph during demonstration and imitation.

Using different visual entities such as human keypoints and object-centric pixel features, we show how these relational inductive biases regarding object fixations and arrangements can provide accurate perceptual rewards for visual imitation. We show how the proposed image graph encoding drives successful imitation of a variety of manipulation skills within minutes, using a single demonstration and without any environment instrumentation.

author = {Maximilian Sieb},
title = {Visual Imitation Learning for Robot Manipulation},
year = {2019},
month = {May},
school = {},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-07},
keywords = {Visual Imitation Learning, Representation Learning, Reinforcement Learning, Robot Learning},
} 2019-05-16T14:37:38-04:00