In this thesis, we present two approaches for solving these challenges. First, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain and demonstrate their advantages over autoregressive models including improved control over quality versus diversity, joint multimodal inpainting, and greater controllability in generation through guidance. Second, we develop a method to jointly train 2D and 3D vision-language models, allowing for knowledge transfer from abundant 2D datasets to comparatively limited 3D tasks. By employing a shared architecture, this approach significantly improves performance on various 3D vision-language tasks.
Committee:
Katerina Fragkiadaki (advisor)
Shubham Tulsiani
Ayush Jain
