Object-Centric Grounding for Deployable and Interactive Vision-Language Navigation Agents

Master's Thesis, Tech. Report, CMU-RI-TR-25-86, September, 2025

View Publication

Abstract

Robots that operate in human-centric environments must integrate perception, reasoning, and action across multiple modalities to complete tasks according to user instructions. For these robots, being able to navigate according to a natural language instruction about the environment is an important capability, which requires 3D spatial reasoning, semantic scene understanding, and the ability to handle vague or ambiguous instructions. Additionally, the diverse and noisy nature of real-world environments motivates the need for vision-language navigation (VLN) systems that are robust and able to adaptively generalize.

This thesis makes two main contributions toward robust, interactive vision-language robotic systems by focusing on the underlying task of 3D object-centric grounding. First, we introduce IRef-VLA, a large-scale 3D benchmark with millions of referential statements and semantic relations, including imperfect language, to support the evaluation of models for 3D scene understanding. Second, we propose SORT3D, a modular framework for grounding object-referential language in 3D by leveraging large vision and language models, heuristic spatial reasoning, and 2D features, achieving zero-shot generalization to unseen environments and real-time operation on autonomous ground vehicle systems. Furthermore, we explore future directions for interactive, dialogue-enabled vision-language navigation by formulating the problem, exploring existing benchmarks and laying the groundwork for future work in this area. Together, these contributions advance general navigation systems that are capable of semantic scene understanding and communicating with human users in complex, real-world settings.

BibTeX

@mastersthesis{Zhang-2025-148906,
author = {Haochen Zhang},
title = {Object-Centric Grounding for Deployable and Interactive Vision-Language Navigation Agents},
year = {2025},
month = {September},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-86},
keywords = {Vision Language Navigation, 3D Scene Understanding, Vision-Language Grounding},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.