Name: RI PhD Thesis Proposal – Hanzhe Hu
Start: 2026-04-17T13:00:00-04:00
End: 2026-04-17T14:30:00-04:00
Location: NSH 3305

Hanzhe Hu PhD Student Robotics Institute,
Carnegie Mellon University

Friday, April 17
1:00 pm to 2:30 pm
NSH 3305

RI PhD Thesis Proposal – Hanzhe Hu

Date: April 17, 2026

Time: 1 PM-2:30 PM

Location: NSH 3305
Zoom link

Type: RI PhD Thesis Proposal

Who: Hanzhe Hu

Title: Learning to Create 3D Worlds via Multi-View Generation

Abstract:
A 3D world is a visual representation that can be rendered from any viewpoint at any moment in time. Creating such representations from minimal input — a single image, a text prompt, or a monocular video is a fundamental goal in computer vision and graphics. An emerging and promising alternative is multi-view generation. However, multi-view generation introduces its own challenges: maintaining geometric consistency across views, achieving practical inference speed, and extending to dynamic scenes. This thesis addresses these three challenges and presents a path toward creating 3D worlds via multi-view generation.

We first present MVD-Fusion, which tackles consistency by introducing depth-guided cross-view attention for multi-view RGB-D generation from a single image. Intermediate depth estimates enable reprojection-based feature aggregation, enforcing geometric consistency and yielding direct 3D reconstruction without costly optimization. We then address efficiency with Turbo3D, which generates 3D Gaussian Splatting assets from text in under one second. A dual-teacher distillation framework compresses a multi-step multi-view diffusion model into a 4-step generator, while a latent-space reconstructor eliminates image decoding overhead. Finally, we tackle dynamics with GeoVideo4D, a framework for camera-controllable multi-view video generation that simultaneously produces synchronized RGB videos and aligned depth maps through a joint video diffusion process, with a hybrid training strategy unifying static 3D, monocular video, and multi-view video data.

Looking ahead, we outline two directions. First, unifying 3D reconstruction and generation by jointly training both tasks in a single model where cameras are learned in a self-supervised manner, enabling training on large-scale unannotated data. Second, extending video generation to long temporal horizons to support sustained, coherent 3D world generation beyond the short clips produced by current methods.

Thesis Committee:

Shubham Tulsiani (chair),

Deva Ramanan,

Jun-Yan Zhu,

Jiajun Wu (Stanford University)

Thesis Draft

PhD Thesis Proposal

April

Share This Event!

Event Navigation