Unified Vision-Language Modeling - Robotics Institute Carnegie Mellon University
Loading Events

MSR Thesis Defense

April

24
Thu
Alexander Swerdlow MSR Students Robotics Institute,
Carnegie Mellon University
Thursday, April 24
1:00 pm to 2:00 pm
GHC 4405
Unified Vision-Language Modeling
Abstract:
Recent advances in large-scale language modeling have demonstrated significant success across various tasks, prompting efforts to extend these capabilities to other modalities, including 2D and 3D vision. However, this effort has been met with a variety of challenges due to fundamental differences in data representations, task-specific requirements, and the relative scarcity of large, high-quality annotated datasets for modalities beyond text.

In this thesis, we present two approaches for solving these challenges. First, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain and demonstrate their advantages over autoregressive models including improved control over quality versus diversity, joint multimodal inpainting, and greater controllability in generation through guidance. Second, we develop a method to jointly train 2D and 3D vision-language models, allowing for knowledge transfer from abundant 2D datasets to comparatively limited 3D tasks. By employing a shared architecture, this approach significantly improves performance on various 3D vision-language tasks.

Committee:
Katerina Fragkiadaki (advisor)
Shubham Tulsiani
Ayush Jain