View Generalizable Manipulation Policies via Sim-to-Real Transfer - Robotics Institute Carnegie Mellon University
Loading Events

MSR Thesis Presentation

July

6
Mon
Maxwell Mino Nakura-Fan MSR Student Robotics Institute,
Carnegie Mellon University
Monday, July 6
10:00 am to 11:00 am
Newell-Simon Hall 4305
View Generalizable Manipulation Policies via Sim-to-Real Transfer
Abstract: Visual imitation learning is a promising approach to training robot manipulation policies capable of completing a wide variety of tasks. A key requirement for these manipulation policies is to exhibit robust generalization capabilities when deployed in the real world, where the objects, scenes, and sensors a robot encounters differ from those seen during training. In practice, learned policies often remain brittle to these changes, which limits their usefulness beyond the narrow conditions in which they were trained.

In this thesis, we study manipulation policies that remain performant under camera viewpoint shifts, so that a single policy can be deployed across various camera poses in the real world. We approach this by grounding the policy in the robot frame in order to reason about the scene and the robot’s actions in a shared frame rather than relative to a particular camera. In the first part, we present ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. The policy operates on point clouds in the robot frame, and we find it remains robust under camera viewpoint changes, including camera poses not seen during training, while generalizing across objects that vary widely in geometry, size, and articulation. By generating a large number of demonstrations in physics-based simulation and distilling the demonstrations into a hierarchical, point cloud-based neural policy via imitation learning, we demonstrate an effective policy learning approach that also achieves object-level generalization.

In the second part, we bring this robot-frame reasoning to image-based policies, which benefit from large-scale pretraining and scalability that point cloud policies do not. We present VGP, an image-based policy that encodes the scene with geometry-aware visual features and grounds its visual, proprioception, and action tokens in a shared robot frame, allowing it to remain robust across a wide range of camera poses, outperforming 2D and 3D baselines, while matching fixed-camera baselines. As a practical consequence, our policy transfers zero-shot from simulation to the real world under random camera configurations.

Across these two parts, we show how large-scale simulation and imitation learning, together with grounding the policy in the robot frame, can be used to train manipulation policies that remain robust as the camera viewpoint changes and transfer to the real world.

Committee: 
Prof. David Held (co-chair)
Prof. Zackory Erickson (co-chair)
Prof. Shubham Tulsiani
Krishna Suresh