/Vision-Language Fusion for Object Recognition

Vision-Language Fusion for Object Recognition

Sz-Rung Shiang, Stephanie Rosenthal, Anatole Gershman, Jaime Carbonell and Jean Hyaejin Oh
Conference Paper, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), February, 2017

Download Publication (PDF)

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.


While recent advances in computer vision have caused object recognition rates to spike, there is still much room for improvement. In this paper, we develop an algorithm to improve object recognition by integrating human-generated contextual information with vision algorithms. Specifically, we examine how interactive systems such as robots can utilize two types of context information–verbal descriptions of an environment and human-labeled datasets. We propose a re-ranking schema, MultiRank, for object recognition that can efficiently combine such information with the computer vision results. In our experiments, we achieve up to 9.4% and 16.6% accuracy improvements using the oracle and the detected bounding boxes, respectively, over the vision-only recognizers. We conclude that our algorithm has the ability to make a significant impact on object recognition in robotics and beyond!

Associated Lab - 3D Vision and Intelligent Systems Group, Associated Lab - BYOB Intelligence Group, Associated Project - WebMate

BibTeX Reference
author = {Sz-Rung Shiang and Stephanie Rosenthal and Anatole Gershman and Jaime Carbonell and Jean Hyaejin Oh},
title = {Vision-Language Fusion for Object Recognition},
booktitle = {Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI)},
year = {2017},
month = {February},
publisher = {AAAI},
keywords = {vision-language, multimodal perception, random walk},