Bridging the Gap Between Human Vision and Computer Vision - Robotics Institute Carnegie Mellon University

Bridging the Gap Between Human Vision and Computer Vision

Nadine Chang
PhD Thesis, Tech. Report, CMU-RI-TR-23-05, May, 2023

Abstract

Computer vision models have proven to be tremendously capable of recognizing and detecting several real-world objects: cars, people, pets. However, the best performing classes have abundant examples in large-scale datasets today and obscured or small objects are still challenging. In short, computer vision perception still falls tremendously short of its gold standard - human perception. Humans are capable of learning novel categories quickly regardless of the amount of data and can classify objects from far away, obscured, or small. This thesis aims to bridge the gap between human and computer vision through two main parts.

In the first part of this thesis, we focus on closing the gap between human and computer vision by improving computer vision models’ performance on datasets with real-world data distributions. Since real-world object distribution is often imbalanced, where some categories are seen frequently while others are seen rarely, models struggle to perform well on under represented classes. Contrastly, humans are remarkably good at learning new objects even if rarely seen. Thus, we aim to improve standard vision tasks on long-tailed distributed datasets which resemble a real-world distribution. Our first approach starts in visual classification task where we aim to increase performance on rarer classes. In this work, we create new stronger classifiers for rarer classes by leveraging the representations and classifiers learnt for common classes. Our simple method can be applied on top of any existing set of classifiers, thus showcasing that learning better classifiers does not require extensive or complicated approaches. Our second approach ventures into visual detection and segmentation, where the additional localization task makes it difficult to train better rare detectors. We take a closer look at the basic resampling approach used widely in detection for long-tailed datasets. Notably, we showcase that the fundamental resampling strategy in detection can be improved by not only resampling whole images but also resampling just objects.

Successful real-world models depend heavily on the quality of training and testing data. In part two of this thesis, we close the gap between human and computer vision by developing a large-scale neuro-imaging dataset and identifing and exploring a large challenge facing visual dataset curation. First, we build the first large-scale visual fMRI dataset, BOLD5000. In an effort to bridge the gap between computer vision and human vision, we design a dataset with 5,000 images taken from computer vision benchmark datasets. Through this effort, we identified a crucial and time-consuming component of dataset curation: creating labeling instructions for annotators and participants. Labeling instructions for a typical visual dataset will include detailed definitions and visual category examples provided to annotators. These labeling instructions through both text description and visual exemplars provide thorough and high-quality category definitions. Unfortunately, current datasets typically do not release their labeling instructions (LIs). We introduce a new task, labeling instruction generation, to reverse engineer LIs from existing datasets. Our method leverages existing large visual and language models (VLMs) to generate LIs that provide visually meaningful exemplars and significantly outperforms all baselines for image retrieval.

BibTeX

@phdthesis{Chang-2023-135812,
author = {Nadine Chang},
title = {Bridging the Gap Between Human Vision and Computer Vision},
year = {2023},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-05},
keywords = {long tail, detection, VLM, dataset curation, human vision, fMRI},
}