/A Generalized Model for Multimodal Perception

A Generalized Model for Multimodal Perception

Sz-Rung Shiang, Anatole Gershman and Jean Hyaejin Oh
Conference Paper, AAAI Fall Symposium, November, 2017

Download Publication (PDF)

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.


In order for autonomous robots and humans to effectively collaborate on a task, robots need to be able to perceive their environments in a way that is accurate and consistent with their human teammates. To develop such cohesive perception, robots further need to be able to digest human teammates’ descriptions of an environment to combine those with what they have perceived through computer vision systems. In this con- text, we develop a graphical model for fusing object recognition results using two different modalities–computer vision and verbal descriptions. In this paper, we specifically focus on three types of verbal descriptions, namely, egocentric positions, relative positions using a landmark, and numeric constraints. We develop a Conditional Random Fields (CRF) based approach to fuse visual and verbal modalities where we model n-ary relations (or descriptions) as factor functions. We hypothesize that human descriptions of an environment will improve robot’s recognition if the information can be properly fused. To verify our hypothesis, we apply our model to the object recognition problem and evaluate our approach on NYU Depth V2 dataset and Visual Genome dataset. We report the results on sets of experiments demonstrating the significant advantage of multimodal perception, and discuss potential real world applications of our approach.

BibTeX Reference
author = {Sz-Rung Shiang and Anatole Gershman and Jean Hyaejin Oh},
title = {A Generalized Model for Multimodal Perception},
booktitle = {AAAI Fall Symposium},
year = {2017},
month = {November},
keywords = {multimodal perception, vision-language, conditional random fields, object recognition},