/Leveraging Inexpensive Supervision Signals for Visual Learning

Leveraging Inexpensive Supervision Signals for Visual Learning

Senthil Purushwalkam Shiva Prakash
Master's Thesis, Tech. Report, CMU-RI-TR-17-13, Robotics Institute, Carnegie Mellon University, May, 2017

Download Publication (PDF)

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.


The success of deep learning based methods for computer vision comes at a cost. Most deep neural network models require a large corpus of annotated data for supervision. The process of obtaining such data is often time consuming and expensive. For example, the process of collecting bounding box annotations takes 26-42 seconds per box. This requirement poses a hindrance for extending these methods to novel domains. In this thesis, we explore techniques for leveraging inexpensive forms of supervision for visual learning. More specifically, we first propose an approach to learn a pose-encoding visual representation from videos of human actions without any human supervision. We show that the learned representation improves performance for pose estimation and action recognition tasks compared to randomly initialized models. Next, we propose an approach to use freely available web data and inexpensive image-level labels to learn object detectors. We show that web data, while highly noisy and biased, can be effectively used to improve localization of objects in the weak-supervision setting.

BibTeX Reference
author = {Senthil Purushwalkam Shiva Prakash},
title = {Leveraging Inexpensive Supervision Signals for Visual Learning},
year = {2017},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-17-13},
keywords = {Unsupervised, Weakly Supervised, Object Detection, Pose Estimation, Action Recognition},