Unifying Perception and Creation with Generative Models - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

November

3
Mon
Zhipeng Bao PhD Student Robotics Institute,
Carnegie Mellon University
Monday, November 3
4:00 pm to 5:30 pm
Newell-Simon Hall 4305
Unifying Perception and Creation with Generative Models

Abstract:

Recent advances in large-scale generative modeling have reshaped our understanding of visual intelligence. While models such as diffusion and autoregressive transformers have achieved remarkable success in image and video synthesis, their potential for visual perception and understanding remains underexplored. This thesis investigates how generative models can serve as powerful visual learners—bridging the long-standing divide between generative and discriminative paradigms.

We begin with REM (Refer Everything Models), a framework for referring video segmentation built upon text-to-video diffusion models. By preserving generative representations and fine-tuning on narrow-domain datasets, REM achieves state-of-the-art results on standard benchmarks and demonstrates strong generalization to unseen domains.

Building on this foundation, we introduce a unified perceptual–generative framework that repurposes a single diffusion model across a broad spectrum of computer vision and image restoration tasks. Through joint training and systematic evaluation over 15 tasks spanning perception and synthesis, we show that diffusion-based models deliver superior or comparable performance to discriminative counterparts, revealing their intrinsic ability to encode rich, multi-modal world representations.

Finally, we extend our exploration to visual autoregressive (VAR) models, presenting the first unified architecture capable of efficiently solving the same 15 tasks within a single framework. We show that latent-variable designs, particularly those leveraging variational autoencoders, are key to achieving coherent multi-modal understanding and consistent generation. Compared with diffusion counterparts, VAR-based models offer substantial gains in latency, scalability, and output consistency.

Collectively, these studies offer a cohesive perspective on unifying perception and synthesis through generative modeling, charting a path toward general-purpose visual foundation models that seamlessly integrate understanding, reasoning, and creation.

Thesis Committee:

Martial Hebert, Chair

Deva Ramanan

Jun-Yan Zhu

Alexei Efros, University of California, Berkeley

Yu-Xiong Wang, University of Illinois Urbana-Champaign

Pavel Tokmakov, Toyota Research Institute

Draft of Document