Visual Representation and Recognition without Human Supervision - Robotics Institute Carnegie Mellon University

Visual Representation and Recognition without Human Supervision

Senthil Purushwalkam
PhD Thesis, Tech. Report, CMU-RI-TR-22-21, Robotics Institute, Carnegie Mellon University, May, 2022

Abstract

The advent of deep learning based artificial perception models has revolutionized the field of computer vision. These methods take advantage of the ever growing computational capacity of machines and the abundance of human-annotated data to build supervised learners for a wide-range of visual tasks. However, the reliance on human-annotated is also a bottleneck for the scalability and generalizability of these methods. We argue that in order to build more general learners (akin to an infant), it is crucial to develop methods that learn without human-supervision. In this thesis, we present our research on minimizing the role of human-supervision for two key problems: Representation and Recognition.

Recent self-supervised representation learning (SSL) methods have demonstrated impressive generalization capabilities on numerous downstream tasks. In this thesis, we investigate these approaches and demonstrate that they still heavily rely on the availability of clean, curated and structured datasets. We experimentally demonstrate that these learning capabilities fail to extend to data collected ``in-the-wild" and hence, expose the need for better benchmarks in self-supervised learning. We also propose novel SSL approaches that minimize this dependence on curated data.

Since exhaustively collecting annotations for all visual concepts is infeasible, methods that generalize beyond the available supervision are crucial for building scalable recognition models. We present a novel neural network architecture that takes advantage of the compositional nature of visual concepts to construct image classifiers for unseen concepts. For domains where collecting dense annotations is infeasible, we present an ``understanding via associations'' paradigm which reformulates the recognition problem as identification of correspondences. We apply this to videos and show that we can densely describe videos by identifying dense spatio-temporal correspondences to other similar videos. Finally, to explore the human ability of generalizing beyond semantic categories, we introduce the ``Functional Correspondence Problem'' and demonstrate that representations that encode functional properties of objects can be used to recognize novel objects more efficiently.

BibTeX

@phdthesis{Purushwalkam-2022-131658,
author = {Senthil Purushwalkam},
title = {Visual Representation and Recognition without Human Supervision},
year = {2022},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-21},
keywords = {representation; recognition; self-supervised; zero-shot; functional; correspondence; invariances;},
}