Unifying Perception and Creation with Generative Models
Abstract
The field of generative models has made remarkable progress in synthesizing photorealistic visual content, encompassing images, videos, and even text. However, the potential of these powerful generative models, such as diffusion-based and auto-regressive transformers, for visual perception and understanding remains underexplored. This thesis investigates how generative models can serve as powerful visual learners -- bridging the long-standing divide between generative and discriminative paradigms.
In the first work, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the textit{diffusion-denoising process}.
Within this framework, we further enhance discriminative visual perception via multi-modal generation by utilizing the denoising network to create multi-modal data that mirrors the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism.
In the second work, we extend such a diffusion-based framework from image understanding to video understanding. Specifically, we explore the potential of diffusion models for video understanding by analyzing the feature representations learned by both image- and video-based diffusion models, alongside non-generative, self-supervised approaches. Our findings reveal that video diffusion models consistently rank among the top performers, particularly excelling at modeling temporal dynamics and scene structure. This observation not only sets them apart from image-based diffusion models but also opens a new direction for advancing video understanding, offering a fresh alternative to traditional discriminative pre-training objectives.
In the third work, beyond merely leveraging video diffusion models as feature extractors, we present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language, by repurposing text-to-video diffusion models for referral video segmentation. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 12 points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.
Finally, building on this foundation, we introduce a unified perceptual–generative framework that repurposes a generative model, comprising both diffusion architecture and visual auto-regressive (VAR) architecture, across a broad spectrum of perception, restoration, and editing tasks. For diffusion-based models, we show that diffusion-based models deliver superior or comparable performance to discriminative counterparts, revealing their intrinsic ability to encode rich, multi-modal world representations. For the VAR variant, we present the first unified architecture capable of efficiently solving 15 tasks within a single framework. We show that latent-variable designs, particularly those leveraging variational autoencoders, are key to achieving coherent multi-modal understanding and consistent generation. Compared with diffusion counterparts, VAR-based models offer substantial gains in latency, scalability, and output consistency.
Collectively, these studies offer a cohesive perspective on unifying perception and synthesis through generative modeling, charting a path toward general-purpose visual foundation models that seamlessly integrate understanding, reasoning, and creation.
BibTeX
@phdthesis{Bao-2025-149638,author = {Zhipeng Bao},
title = {Unifying Perception and Creation with Generative Models},
year = {2025},
month = {November},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-99},
keywords = {Generative Models, Diffusion Models, Unified Models},
}