This thesis addresses these challenges through three complementary research directions focused on leveraging large scale pretrained priors for 3D and 4D reconstruction. The first direction tackles relighting of static content. MaterialFusion introduces a 2D material diffusion prior trained on high-quality PBR assets to guide inverse rendering, enabling accurate disentanglement of geometry, materials, and lighting from multi-view images under unknown illumination. Building on this foundation, LightSwitch presents a multi-view consistent relighting diffusion framework that leverages inferred material cues to efficiently relight objects in as little as 2 minutes, matching or exceeding the quality of optimization-based methods that take hours.
The second direction addresses 4D reconstruction in the wild. Lift4D presents a test-time optimization framework for complete 4D reconstruction from monocular video, using causally conditioned image-to-3D priors and occlusion-aware supervision to handle large deformations and severe occlusions that challenge existing methods. We propose to extend this work by training a feedforward 4D reconstruction model on Lift4D outputs, enabling real-time 4D capture without test-time optimization.
The third direction focuses on video editing. EditCtrl introduces a disentangled editing framework with local and global control that achieves 10 times speedup over state-of-the-art methods while improving editing quality, enabling real-time video editing for applications such as augmented reality. We propose to extend this with an action-conditioned autoregressive framework that treats edited content as an embodied agent, enabling spatially-aware generation that responds to scene context and user actions.
Thesis Committee:
Shubham Tulsiani (Co-chair),
Fernando De la Torre, (Co-chair)
Kris Kitani,
Christian Richardt, Meta Reality Labs
