Watch, Practice, Improve: Towards In-the-wild Manipulation

PhD Thesis, Tech. Report, CMU-RI-TR-24-04, February, 2024

View Publication

Abstract

The longstanding dream of many roboticists is to see robots perform diverse tasks in diverse environments. To build such a robot that can operate anywhere, many methods train on robotic interaction data. While these approaches have led to significant advances, they rely on heavily engineered setups or high amounts of supervision, neither of which is scalable. How can we move towards training robots that operate autonomously, in the wild? Unlike computer vision and natural language in which a staggering amount of data is available on the internet, robotics faces a chicken-and-egg problem: to train robots to work in diverse scenarios, we need a large amount of robot data from diverse environments but to collect this kind of data, we need robots to be deployed widely - which is feasible only if they are already proficient. How can we break this deadlock?
The proposed solution, and the goal of my thesis, is to use an omnipresent source of rich interaction data -- humans. Fortunately, there are plenty of real-world human interaction videos on the internet, which can help bootstrap robot learning by side-stepping the expensive aspects of the data collection-training loop. To this end, we aim to learn manipulation from watching humans perform various tasks. We circumvent the embodiment gap by imitating the effect the human has on the environment, instead of the exact actions. We obtain interaction priors, and subsequently practice directly in the real world to improve. To move beyond explicit human supervision, the second work in the thesis aims to predict robot-centric visual affordances: where to interact and how to move post interaction, directly from offline human video datasets. We show that this model can be seamlessly integrated into any robot learning paradigm. The third part of the thesis focuses on how to build general-purpose policies by leveraging human data. We show that world models are strong mechanisms to share representations across human and robot data coming from many different environments. We use a structured affordance-based action space to train multitask policies and show that this greatly boosts performance. In the fourth work of the thesis, we investigate how to use human data to build actionable representations for control. Our key insight is to move beyond traditional training of visual encoder and use human actions and affordances to improve the model. We find that this approach can improve real-world imitation learning performance for almost any pre-trained model, across multiple challenging tasks. Finally, visual affordances may struggle to capture complex action spaces, especially in high-degree-of-freedom robots such as dexterous hands. Thus, in the final works of the thesis, we explore how to learn more explicit, physically grounded action priors from human videos, mainly in the context of dexterous manipulation.

BibTeX

@phdthesis{Bahl-2024-139937,
author = {Shikhar Bahl},
title = {Watch, Practice, Improve: Towards In-the-wild Manipulation},
year = {2024},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-24-04},
keywords = {Robot Learning, Manipulation, Perception},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.