Visual Learning with Minimal Human Supervision - The Robotics Institute Carnegie Mellon University
Home/Visual Learning with Minimal Human Supervision

Visual Learning with Minimal Human Supervision

PhD Thesis, Tech. Report, CMU-RI-TR-18-40, Robotics Institute, Carnegie Mellon University, August, 2018
View Publication


Machine learning models have led to remarkable progress in visual recognition. A key driving factor for this progress is the abundance of labeled data. Over the years, researchers have spent a lot of effort curating visual data and carefully labeling it. However, moving forward, it seems impossible to annotate the vast amounts of visual data with everything that we care about. This reliance on exhaustive labeling is a key limitation in the rapid deployment of computer vision systems in the real world. Our current systems also scale poorly to the large number of concepts and are passively spoon-fed supervision and data.

In this thesis, we explore methods that enable visual learning without exhaustive supervision. Our core idea is to model the natural regularity and repetition in the visual world in our learning algorithms as their inductive bias. We observe recurring patterns in the visual world - a person always lifts their foot before taking a step, dogs are similar to other furry creatures than to furniture etc. This natural regularity in visual data also imposes regularities on the semantic tasks and models that operate on it - a dog classifier must be similar to classifiers of furry animals than to furniture classifiers. We exploit this abundant natural structure or `supervision' in the visual world in the form of self-supervision for our models, modeling relationships between tasks and labels, and modeling relationships in the space of classifiers. We show the effectiveness of these methods on both static images and videos across varied tasks such as image classification, object detection, action recognition, human pose estimation etc. However, all these methods are still passively fed supervision and thus lack agency: the ability to decide what information they need and how to get it. To this end, we propose having an interactive learners that ask for supervision when needed and can also decide what samples they want to learn from.


author = {Ishan Misra},
title = {Visual Learning with Minimal Human Supervision},
year = {2018},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-18-40},
keywords = {computer vision; machine learning; semi-supervised learning; self-supervised learning; unsupervised learning; noisy labels; interactive learning; image tagging; object recognition; visual recognition; natural language processing},