Abstract:
Recent advances in large-scale generative modeling have reshaped our understanding of visual intelligence. While models such as diffusion and autoregressive transformers have achieved remarkable success in image and video synthesis, their potential for visual perception and understanding remains underexplored. This thesis investigates how generative models can serve as powerful visual learners—bridging the long-standing divide between generative and discriminative paradigms.
We begin with REM (Refer Everything Models), a framework for referring video segmentation built upon text-to-video diffusion models. By preserving generative representations and fine-tuning on narrow-domain datasets, REM achieves state-of-the-art results on standard benchmarks and demonstrates strong generalization to unseen domains.
Building on this foundation, we introduce a unified perceptual–generative framework that repurposes a single diffusion model across a broad spectrum of computer vision and image restoration tasks. Through joint training and systematic evaluation over 15 tasks spanning perception and synthesis, we show that diffusion-based models deliver superior or comparable performance to discriminative counterparts, revealing their intrinsic ability to encode rich, multi-modal world representations.
Finally, we extend our exploration to visual autoregressive (VAR) models, presenting the first unified architecture capable of efficiently solving the same 15 tasks within a single framework. We show that latent-variable designs, particularly those leveraging variational autoencoders, are key to achieving coherent multi-modal understanding and consistent generation. Compared with diffusion counterparts, VAR-based models offer substantial gains in latency, scalability, and output consistency.
Collectively, these studies offer a cohesive perspective on unifying perception and synthesis through generative modeling, charting a path toward general-purpose visual foundation models that seamlessly integrate understanding, reasoning, and creation.
Thesis Committee:
Martial Hebert, Chair
Deva Ramanan
Jun-Yan Zhu
Alexei Efros, University of California, Berkeley
Yu-Xiong Wang, University of Illinois Urbana-Champaign
Pavel Tokmakov, Toyota Research Institute
