Zoom link
A 3D world is a visual representation that can be rendered from any viewpoint at any moment in time. Creating such representations from minimal input — a single image, a text prompt, or a monocular video is a fundamental goal in computer vision and graphics. An emerging and promising alternative is multi-view generation. However, multi-view generation introduces its own challenges: maintaining geometric consistency across views, achieving practical inference speed, and extending to dynamic scenes. This thesis addresses these three challenges and presents a path toward creating 3D worlds via multi-view generation.
We first present MVD-Fusion, which tackles consistency by introducing depth-guided cross-view attention for multi-view RGB-D generation from a single image. Intermediate depth estimates enable reprojection-based feature aggregation, enforcing geometric consistency and yielding direct 3D reconstruction without costly optimization. We then address efficiency with Turbo3D, which generates 3D Gaussian Splatting assets from text in under one second. A dual-teacher distillation framework compresses a multi-step multi-view diffusion model into a 4-step generator, while a latent-space reconstructor eliminates image decoding overhead. Finally, we tackle dynamics with GeoVideo4D, a framework for camera-controllable multi-view video generation that simultaneously produces synchronized RGB videos and aligned depth maps through a joint video diffusion process, with a hybrid training strategy unifying static 3D, monocular video, and multi-view video data.
Looking ahead, we outline two directions. First, unifying 3D reconstruction and generation by jointly training both tasks in a single model where cameras are learned in a self-supervised manner, enabling training on large-scale unannotated data. Second, extending video generation to long temporal horizons to support sustained, coherent 3D world generation beyond the short clips produced by current methods.
