To support semantic understanding, we introduce VLA-3D and IRef-VLA—a benchmark and dataset for vision-language alignment and referential grounding in 3D scenes. We also propose SORT3D, a method that leverages the reasoning abilities of pretrained vision-language models for 3D tasks. Additionally, we explore how foundational 2D features can bootstrap semantic understanding in 3D environments. For geometric reasoning, we highlight the role of camera models in 3D understanding and present VOLNet, a learning-based visual-LiDAR odometry model that demonstrates how multimodal grounding between 2D and 3D can enhance geometric reasoning. Finally, we explore emerging 3D foundation models and their potential to unify and advance diverse 3D reasoning capabilities. Through comprehensive evaluations, we show that our datasets and methods advance 3D reasoning and help bridge the gap between abstract understanding and real-world physical environments.
Committee:
Prof. Ji Zhang (advisor)
Prof. Shubham Tulsiani
Brian Yang
