Unconstrained Perception for Scalable Robot Manipulation - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

October

20
Mon
Bardienus Duisterhof PhD Student Robotics Institute,
Carnegie Mellon University
Monday, October 20
3:00 pm to 4:30 pm
Newell-Simon Hall 4305
Unconstrained Perception for Scalable Robot Manipulation

Abstract: Advances in visual imitation learning driven by large-scale data and expressive policy architectures have yielded impressive progress on long-horizon, dexterous tasks. However, current success rates remain insufficient for industrial deployment, which demands near-perfect reliability on novel tasks. Compared to other fields such as NLP and CV, the available data in robotics is several orders of magnitude smaller. This raises the question: how can we most effectively leverage priors from large-scale offline data? In this thesis, I contribute methods to infer strong geometric and dynamic priors for robot manipulation.

First, geometric camera calibration is a critical prerequisite for real-world vision systems. I will discuss our work on MASt3R-SfM for unconstrained SfM from any image collection in linear complexity. Next, I discuss how we use a large set of calibrated cameras in DeformGS for photorealistic digital twins with millimeter-accurate tracking of deformable cloth. Removing the need for costly multi-camera systems, I introduce RaySt3R, a method to generate complete object geometry from a single RGB-D image.

Building on these works, I will introduce our ongoing work on Flow2Flow – a flexible end-to-end feedforward architecture for zero-shot dynamics prediction. Many challenging tasks involve manipulating unseen articulated, deformable, and cluttered rigid objects; prior approaches rely on pre-trained VLMs, large-scale 2D point tracking, or previous interactions with the scene to inject priors. We cast dynamics prediction as a scene flow completion problem from a single RGB-D image, and propose an optional two-stage adaptation procedure for unseen dynamics. We further study scene flow completion as a 3D pretraining objective for multi-task learning and propose scaling up training on real-world data for the first benchmark in dynamics prediction from a single image.

Thesis Committee Members:
Jeffrey Ichnowski (Chair)
Deva Ramanan
Shubham Tulsiani
Abhishek Gupta (University of Washington)

Thesis Proposal Draft