From Pixels to Physical Intelligence: Semantic 3D Data Generation at Internet Scale - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

May

2
Fri
Mosamkumar Dabhi PhD Student Robotics Institute,
Carnegie Mellon University
Friday, May 2
8:30 am to 10:00 am
GHC 4405
From Pixels to Physical Intelligence: Semantic 3D Data Generation at Internet Scale

Abstract:
Modern AI won’t achieve physical intelligence until it can extract rich, semantic spatial knowledge from the wild ocean of internet video—not just curated motion-capture datasets or expensive 3D scans. This thesis proposes a self-bootstrapping pipeline for converting raw pixels into large-scale 3D and 4D spatial understanding. It begins with multi-view bootstrapping: using just two handheld videos and ~1% 2D keypoints to produce dense 2D, 3D keypoints—no calibration, no 3D ground truth required. This sets the stage for geometry-only supervision at scale. Next, category-agnostic 3D lifting transformer model generalizes from a single RGB frame or keypoint set to full 3D shape and pose—across dozens of object classes, zero-shot to unseen categories. Then, label-free mixers, a lightweight MLP architecture, rivals transformer accuracy in unsupervised 2D→3D lifting—proving that with the right inductive biases, 3D supervision becomes obsolete. Finally, template-free 4D rigging reanimates articulated objects with dynamic dense meshes and skeletal motion, removing the need for SMPL-style priors.

Building on these validated components, this thesis will contribute: (1) a unified framework integrating these approaches into a continuous learning pipeline, (2) extensive evaluation across diverse domains, and (3) showcase of physical intelligence capabilities in downstream robotics and AR/VR applications. Preliminary results show this integrated approach expands articulated data distribution coverage while reducing annotation costs compared to traditional methods. The completed thesis will provide a scalable, geometry-grounded foundation for embodied AI, enabling robots, AR/VR agents, and multimodal systems to perceive, reason, and act robustly in the complex spatial world.

Thesis Committee Members:
Laszlo Jeni (Chair)
Simon Lucey (Co-chair, University of Adelaide)
Katerina Fragkiadaki
Jason Saragih (Meta AI)

More Information