Advancing 3D Semantic and Geometric Reasoning
Abstract
To operate effectively in the physical world, AI agents must be able to understand both the structure and meaning of their environments, reasoning in three dimensions about object identity, spatial layout, and affordances. While recent advances in foundation models have enabled strong generalization in language and vision tasks, these systems remain limited in their ability to perform grounded reasoning in 3D space. We address this gap by introducing new dataset, models, and insight for 3D semantic and geometric understanding. We present VLA-3D, a benchmark for vision-language alignment in 3D scenes, and IRef-VLA, a dataset for referential expression resolution in 3D environments. These datasets support fine-grained evaluation of open-vocabulary, spatially grounded language understanding. Building on them, we propose SORT3D, a method that adapts pretrained vision-language and language models for 3D tasks. Without explicit 3D supervision, SORT3D achieves strong results in language-driven segmentation and spatial referencing by leveraging structural context. Ultimately, we demonstrate that incorporating 2D visual features improves 3D semantic reasoning, especially under partial observations. Additionally, we explore the effects of camera models on 3D geometric understanding and introduce VOLNet, a learning-based visual-LiDAR odometry model, which shows how careful geometric representation and supervision can enable robust spatial understanding. Together, these contributions provide a step toward more capable embodied agents and 3D foundation models, capable of reasoning and acting within complex, real-world 3D environments.
BibTeX
@mastersthesis{Kachana-2025-146402,author = {Pujith Kachana},
title = {Advancing 3D Semantic and Geometric Reasoning},
year = {2025},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-18},
keywords = {3D Learning, Spatial Intelligence, Embodied AI, Foundation Models},
}