Lowering Barriers to Human-Robot Communication
Abstract
For robots to be intuitive partners in learning and collaboration, they must interpret multimodal cues - visual gestures, natural language instructions, environmental context - and translate them into effective actions. However, existing robot policies typically rely on structured language goals and static visual observations, limiting both the richness of sensory understanding and the flexibility of human task specification. This thesis addresses three core challenges in developing embodied intent-awareness: (i) training policies to capture explicit human intent, (ii) understanding implicit intents, and (iii) enabling scalable benchmarks for these capabilities.
First, I develop policies that infer our explicit intents, which we broadly express via visual demonstrations and verbally. For visual cues, I extract task semantics from a single video prompt to guide simulated dish‐loading tasks and general manipulation task, using cross‐attention transformers. For verbal instructions, I focus on the hierarchical language structure for goal specification; decomposing verbal commands into sequence of interaction points and relative waypoints for sample‐efficiency.
Second, I incorporate implicit social and environmental constraints, such as predicting and minimizing robot noise at the listener’s location, to augment planning costs and ensure comfortable human–robot coexistence.
Lastly, I focus beyond model development and towards scalable evaluation benchmarks that expand the notions of generalization in robot manipulation and open challenges in robotics foundation models. I present ActVQ-Arena, an embodied visual querying environment where agents must locate fine‐grained information (e.g., “find the expiry date” or “check ingredients”) under partial observability, varying poses, and diverse embodiments.
Together, these contributions advance robot autonomy by bridging human expressive modalities and practical control, paving the way for future work on multi‐objective pragmatic instruction following, rapid adaptation from multimodal cues, and personalization to individual preferences.
BibTeX
@phdthesis{Jain-2025-148409,author = {Vidhi Jain},
title = {Lowering Barriers to Human-Robot Communication},
year = {2025},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-83},
keywords = {multimodal learning, home robots},
}