Unified 3D Perception and Generative Control for Generalist Robots
Abstract
To enable robot generalists that can operate across tasks, scenes, and embodiments, we need policies that are expressive, multimodal, and grounded in 3D spatial understanding. This thesis explores how unifying 3D perception with generative policy learning advances cross-domain generalization in intelligent robot agents.
We begin by developing foundational 3D perception systems: an open-vocabulary detector adaptable to both 2D and 3D scenes, a unified segmentation model that bridges 2D-3D visual domains, and memory-prompted networks that discover 3D correspondences across scenes without supervision. These advances establish the architectural groundwork for general-purpose 3D manipulation.
Building on these foundations, we present a suite of generative 3D manipulation models—including goal generators, planners, and equivariant policies—that progressively scale in complexity and task versatility. This culminates in 3D Diffuser Actor (3DDA), the first 3D diffusion policy for robotic manipulation, enabling multimodal behaviors and strong task performance. We further generalize this to 3D Flow Actor (3DFA), a versatile policy architecture supporting single-arm, and bimanual manipulation, with 20 times faster training and inference, outperforming even 1000 times larger contemporary models.
Finally, we show that scaling 3D policies to billions of parameters can preserve their spatial grounding and improve data efficiency, when coupled with key architectural choices. This thesis demonstrates that 3D-centric generative policies not only unlock robust and versatile robot behaviors but also pave the way for scalable policy design, advancing the vision of generalist robots.
BibTeX
@phdthesis{Gkanatsios-2025-148225,author = {Nikolaos Gkanatsios},
title = {Unified 3D Perception and Generative Control for Generalist Robots},
year = {2025},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-76},
keywords = {Robot learning, Manipulation, 3D representation learning, 3D vision, Generative models, Diffusion, Flow matching},
}