Loading Events

PhD Thesis Defense

March

30
Fri
Friday, March 30
9:00 am to 10:00 am
GHC 4405
Learning Multi-Modal Navigation for Unmanned Ground Vehicles
The Event has been Postponed.

Abstract:
A robot that operates efficiently in a team with a human in an unstructured outdoor environment must be able to translate commands from a modality that is intuitive to its operator into actions. This capability is especially important as robots become ubiquitous and interact with untrained users. For this to happen, the robot must be able to perceive the world as humans do, so that the nuances of natural language and human perception are appropriately reflected in the actions taken by the robot. Traditionally, this has been done with separate perception, language processing, and planning blocks unified by a grounding system. The grounding system relates abstract symbols in the command to concrete representations in perception that can be placed into a metric or topological map upon which one can execute a planner. These modules are trained separately, often with different performance specifications, and are connected with restrictive interfaces to ease development and debugging (i.e., point objects with discrete attributes), but which also limit the kinds of information one module can transfer to another.

The tremendous success of deep learning has revolutionized traditional lines of research in computer vision, such as object detection and scene labeling. The latest work goes even further, bringing together state of the art techniques in natural language processing with image understanding in what is called visual question answering, or VQA. Symbol grounding, multi-step reasoning, and comprehension of spatial relations are already elements of these systems, all contained in a single differentiable deep learning architecture, eliminating the need for well-defined interfaces between modules and the simplifying assumptions that go with them.

Building upon this work, we introduce a technique to transform a natural language command and a static aerial image, into a cost map suitable for planning. With this technique, we take a step towards unifying language, perception, and planning in a single, end-to-end trainable system. Further, we propose a synthetic benchmark based upon the CLEVR dataset, which can be used to compare the strengths weakness of the comprehension abilities of various planning algorithms in the context of an unbiased environment with virtually unlimited data. Finally, we propose some extensions to the system as steps towards practical robotics applications.

More Information

Thesis Committee Members:
Martial Hebert, Chair
Kris Kitani
Jean Oh
Junsong Yuan, State University of New York at Buffalo