A Generalized Model for Multimodal Perception - Robotics Institute Carnegie Mellon University

A Generalized Model for Multimodal Perception

Sz-Rung Shiang, Anatole Gershman, and Jean Hyaejin Oh
Conference Paper, Proceedings of AAAI '17 Fall Symposium, November, 2017

Abstract

In order for autonomous robots and humans to effectively collaborate on a task, robots need to be able to perceive their environments in a way that is accurate and consistent with their human teammates. To develop such cohesive perception, robots further need to be able to digest human teammates’ descriptions of an environment to combine those with what they have perceived through computer vision systems. In this con- text, we develop a graphical model for fusing object recognition results using two different modalities–computer vision and verbal descriptions. In this paper, we specifically focus on three types of verbal descriptions, namely, egocentric positions, relative positions using a landmark, and numeric constraints. We develop a Conditional Random Fields (CRF) based approach to fuse visual and verbal modalities where we model n-ary relations (or descriptions) as factor functions. We hypothesize that human descriptions of an environment will improve robot’s recognition if the information can be properly fused. To verify our hypothesis, we apply our model to the object recognition problem and evaluate our approach on NYU Depth V2 dataset and Visual Genome dataset. We report the results on sets of experiments demonstrating the significant advantage of multimodal perception, and discuss potential real world applications of our approach.

BibTeX

@conference{Oh-2017-103004,
author = {Sz-Rung Shiang and Anatole Gershman and Jean Hyaejin Oh},
title = {A Generalized Model for Multimodal Perception},
booktitle = {Proceedings of AAAI '17 Fall Symposium},
year = {2017},
month = {November},
keywords = {multimodal perception, vision-language, conditional random fields, object recognition},
}