Unified 3D Perception and Generative Control for Generalist Robots
Abstract:
To build robot generalists, we need models that can operate across diverse tasks, scenes, and embodiments. While recent efforts scale data and model capacity and incorporate expressive generative objectives, most still rely on 2D inputs to predict inherently 3D actions—introducing a mismatch between perception and control. In my thesis, I explore how unifying 3D spatial representations with generative models enables policies that are expressive, multimodal, and grounded in the physical world.
My early work established foundations in 3D vision and policy learning, leading to the development of 3D Diffuser Actor (3DDA)—the first 3D diffusion policy for general robotic manipulation. 3DDA demonstrated that combining 2D foundational representations with 3D-aware attention in a generative framework enables multimodal behavior and strong performance across tasks.
In this talk, I will introduce 3D Flow Actor (3DFA), a versatile generalization of 3DDA that supports single-arm, bimanual and dexterous manipulation, while offering up to 20× faster training and inference by integrating recent advances in generative modeling. 3DFA achieves state-of-the-art results across simulation benchmarks and demonstrates robust real-world performance on the bimanual ALOHA platform—outperforming contemporary policies with 1000x more parameters.
Next, I will present key design choices for effectively scaling 3D policies in size. Our 3DFA-VLA model integrates a vision-language backbone with carefully designed feature upsampling layers to construct a billion-parameter policy that preserves explicit 3D token grounding. This results in strong performance and improved data efficiency compared to 2D vision-language-action models.
I will conclude by outlining future directions enabled by our work, including large-scale real-world training on automatically calibrated RGB-D data.
——————
Thesis committee:
Katerina Fragkiadaki (chair)
Yonatan Bisk
Shubham Tulsiani
Abhishek Gupta (Univ. of Washington)
Yonatan Bisk
Shubham Tulsiani
Abhishek Gupta (Univ. of Washington)
Link to thesis draft:
Meeting ID: 607 608 9211
Passcode: 689168
Passcode: 689168
