
Abstract:
With the popularity of Virtual Reality (VR), Augmented Reality (AR), and other 3D applications, developing methods that let everyday users capture and create their own 3D content has become increasingly essential. However, current 3D creation pipelines often require either tedious manual effort or specialized capture setups. Additionally, resulting assets often suffer from baked-in lighting, inconsistent representations, and a lack of physical plausibility, making them incompatible with downstream applications.
My research addresses these challenges by leveraging priors from other modalities, datasets, and large-scale diffusion models to reduce the burden on user input to casually captured photos, videos, simple sketches, and text. We first show how depth priors can enable users to digitalize 3D scenes without dense data capture, and discuss how to enable interactive 3D editing and generation through 2D user inputs such as sketches. Moreover, we discuss how data and diffusion model priors can be utilized to generate relightable textures on meshes using text input, ensuring that generated 3D objects are functional in downstream production workflows. For shape generation, we propose an octree-based adaptive tokenization scheme that allocates representational capacity based on shape complexity, enabling higher-fidelity and more efficient reconstruction and generation of 3D shapes. Finally, to ground digital designs in reality, we introduce BrickGPT, which incorporates manufacturing and physics constraints to generate physically stable and buildable toy brick structures from text prompts. Collectively, these contributions bridge the gap between high-level user intent and the creation of editable, functional, and physically realizable 3D content.
Thesis Committee Members:
Jun-Yan Zhu (Co-chair)
Deva Ramanan (Co-chair)
Shubham Tulsiani
Maneesh Agrawala (Stanford)
Noah Snavely (Cornell Tech & Google)