Unified Vision-Language Modeling - Robotics Institute Carnegie Mellon University

Unified Vision-Language Modeling

Master's Thesis, Tech. Report, CMU-RI-TR-25-23, April, 2025

Abstract

Recent developments in language modeling on large-scale datasets have found profound success in a variety of tasks. Given this, many researchers have sought to apply these same capabilities to other modalities, such as 2D or 3D vision. However, this effort has been met with a variety of challenges due to differences in 1) the underlying data representation, 2) the desired tasks for each modality, and 3) the availability of large, high-quality datasets. In this thesis, we present two approaches for solving these challenges. First, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain and demonstrate their advantages over autoregressive models including in improved control over quality versus diversity, joint multimodal inpainting, and greater controllability in generation through guidance. Second, we present a method for unifying the training of 2D and 3D vision-language models using a single, unified model, enabling the use of large-scale 2D datasets for 3D tasks.

BibTeX

@mastersthesis{Swerdlow-2025-146523,
author = {Alexander Swerdlow},
title = {Unified Vision-Language Modeling},
year = {2025},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-23},
}