Home/Visual Learning with Minimal Human Supervision

Visual Learning with Minimal Human Supervision

Ishan Misra
PhD Thesis, Tech. Report, CMU-RI-TR-18-40, August, 2018

View Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Machine learning models have led to remarkable progress in visual recognition. A key driving factor for this progress is the abundance of labeled data. Over the years, researchers have spent a lot of effort curating visual data and carefully labeling it. However, moving forward, it seems impossible to annotate the vast amounts of visual data with everything that we care about. This reliance on exhaustive labeling is a key limitation in the rapid deployment of computer vision systems in the real world. Our current systems also scale poorly to the large number of concepts and are passively spoon-fed supervision and data.

In this thesis, we explore methods that enable visual learning without exhaustive supervision. Our core idea is to model the natural regularity and repetition in the visual world in our learning algorithms as their inductive bias. We observe recurring patterns in the visual world – a person always lifts their foot before taking a step, dogs are similar to other furry creatures than to furniture etc. This natural regularity in visual data also imposes regularities on the semantic tasks and models that operate on it – a dog classifier must be similar to classifiers of furry animals than to furniture classifiers. We exploit this abundant natural structure or `supervision’ in the visual world in the form of self-supervision for our models, modeling relationships between tasks and labels, and modeling relationships in the space of classifiers. We show the effectiveness of these methods on both static images and videos across varied tasks such as image classification, object detection, action recognition, human pose estimation etc. However, all these methods are still passively fed supervision and thus lack agency: the ability to decide what information they need and how to get it. To this end, we propose having an interactive learners that ask for supervision when needed and can also decide what samples they want to learn from.

author = {Ishan Misra},
title = {Visual Learning with Minimal Human Supervision},
year = {2018},
month = {August},
school = {},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-18-40},
keywords = {computer vision; machine learning; semi-supervised learning; self-supervised learning; unsupervised learning; noisy labels; interactive learning; image tagging; object recognition; visual recognition; natural language processing},
} 2018-08-02T09:36:46-04:00