Customizing Text-to-Image Diffusion Models - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

October

9
Thu
Nupur Kumari PhD Student Robotics Institute,
Carnegie Mellon University
Thursday, October 9
9:30 am to 11:00 am
Newell-Simon Hall 4305
Customizing Text-to-Image Diffusion Models
Abstract: With the rapid advancement of generative models, their potential to transform creative content creation is increasingly evident. However, most large-scale generative models are primarily text-conditioned, given the availability of large-scale paired text–image datasets. In contrast, for most practical applications, creators often begin from an existing asset and wish to generate variations or modify it in specific ways. For images, this may involve placing an object in a new context, adjusting local attributes, or altering visual style. My research focuses on customizing pre-trained generative models, primarily text-to-image diffusion models, to facilitate such downstream tasks. A central challenge here is the lack of paired input–output data for these tasks.

To address this, I explore three complementary directions:

Part I: I study few-shot learning methods, which are computationally efficient but require fine-tuning for each new task instance. This limitation motivates the second direction.

Part II: Constructing synthetic paired datasets using the capabilities of pre-trained generative models themselves to train feed-forward models in a supervised manner. However, constructing such datasets requires careful curation, filtering, and risk of becoming outdated as base pre-trained models evolve. Building on these insights, my thesis proposes a third paradigm.

Part III: Customizing generative models without paired supervision. Instead, we plan to leverage vision–language models to evaluate task success and provide direct gradient-based feedback to the generative model. This approach has the potential to create a scalable and robust framework for efficient customization of generative models for downstream tasks without relying on synthetic datasets.

 
Thesis Committee:
Jun-Yan Zhu (Chair)
Deva Ramanan
Shubham Tulsiani
Phillip Isola (MIT)