Seeing in 3D: Towards Generalizable 3D Visual Representations for Robotic Manipulation

Master's Thesis, Tech. Report, CMU-RI-TR-23-16, May, 2023

View Publication

Abstract

Despite the recent progress in computer vision and deep learning, robot perception remains a tremendous challenge due to the variations of the objects and the scenes in manipulation tasks. Ideally, a robot trying to manipulate a new object should be able to reason about the object’s geometric, physical, and topological properties. In this thesis, we aim to investigate different strategies for enabling a robot to reason about objects using 3D visual signals in a generalizable manner.
In the first project, we propose a vision-based system, FlowBot 3D, that learns to predict the potential motions of the parts of a variety of articulated objects to guide downstream motion planning of the system to articulate the objects. To predict the object motions, we train a neural network to output a dense vector field representing the point-wise motion direction of the points in the point cloud under articulation. We then deploy an analytical motion planner based on this vector field to achieve a provably optimal policy. We train a single vision model entirely in simulation across all categories of objects, and we demonstrate the capability of our system to generalize to unseen object instances and novel categories in both simulation and the real world using the trained model for all categories, deploying our policy on a Sawyer robot.
In the second project, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship
“cross-pose”. We propose a vision-based system, TAX-Pose, that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship. We demonstrate our method’s capability to generalize to unseen objects in the real world.
We also demonstrate that we are able to combine the two systems together using weighted SVD for more complex manipulation tasks that involve both articulated and free-floating objects. By finetuning pretrained Flow-Bot 3D and TAX-Pose models, we show that we can generalize to a wider variety of manipulation tasks and even planning.

BibTeX

@mastersthesis{Zhang-2023-135801,
author = {Haolun (Harry) Zhang},
title = {Seeing in 3D: Towards Generalizable 3D Visual Representations for Robotic Manipulation},
year = {2023},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-16},
keywords = {Object-centric representations, robot learning, 3D vision},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.