Lowering Barriers in Human-Robot Communication
Abstract:
For robots to collaborate naturally in homes, they must interpret diverse forms of human expression – visual gestures, natural language instructions, environmental context – and translate them into actions. Existing robot policies typically rely on structured language goals and static visual observations, which restricts both the sensory context and the ways users can specify tasks. In this thesis, I tackle three fundamental challenges to bring robots closer to intuitive, human-centered collaboration. (i) learning to act on explicit intent, (ii) understanding implicit constraints, and (iii) scaling evaluations for intent-aware policies.
First, I show how to train policies that infer explicit human intent from visual or verbal demonstrations. Using cross‐attention transformers, we extract task semantics from a single video prompt to guide simulated dish‐loading tasks (TTP, CoRL’22) and then, general manipulation tasks (Vid2Robot, CoRL’24). We also leverage hierarchical language structure for goal specification, decomposing verbal commands into sequence of interaction points and relative waypoints to enable sample‐efficient task execution (SLAP, CoRL’23).
Second, I demonstrate how to incorporate implicit social and environmental constraints. By predicting robot noise at the listener’s location, and augmenting it into planning costs, robots can operate and ensure comfortable human–robot coexistence (ANAVI, CoRL’24).
Third, I shift the focus beyond model development toward scalable evaluation benchmarks and open challenges for robotic foundation models. Building on my contributions to HomeRobot (CoRL ’23), I create ActVQ-Arena, an embodied visual-query environment where agents must locate fine-grained information, e.g., “find the expiry date” or “check ingredients” on real object scans, under partial observability, varying poses, and diverse embodiments.
Together, these contributions bridge expressive human modalities and practical robot control, laying the groundwork for multi-objective pragmatic instruction following, rapid multimodal adaptation, and personalized robot assistance.
Thesis Committee Members:
Yonatan Bisk (Chair)
Oliver Kroemer
Henny Admoni
Dieter Fox (University of Washington)
Oliver Kroemer
Henny Admoni
Dieter Fox (University of Washington)
