PhD Thesis Proposal
Carnegie Mellon University
1:00 pm - 2:00 pm
Computer vision today excels at recognition in narrow slices of the real world. Our systems seem to accurately detect cats, cars, or chairs, but largely ignore the vast diversity of objects in the world that are absent from our training datasets. Perception in the open world, however, requires detecting and tracking any object, regardless of its name. Such an approach can serve as a fundamental building block for downstream applications: from recognizing actions, like picking something up, to navigating around obstacles. Unfortunately, current methods for generic object recognition lag far behind closed-world methods that only recognize a few object categories. This thesis focuses on the challenges of building accurate, reliable models for the open world, often by leveraging recent advances in closed-world methods.
We first present an approach that tackles the task of detecting any moving object. To do this, we learn to group pixels that move together from synthetic data, and learn a generic model of object appearance from large image datasets. Next, we build a single-object tracker for detecting any object specified by a user. We show that existing improvements in models and data for class-specific detection can be repurposed for generic tracking, leading to significant improvements over prior work. Finally, we design a benchmark for measuring a recurring challenge in this work: the brittleness of models to small changes in their inputs. To measure this, we construct a dataset of 3,000 human-reviewed sets of real images with minor differences, and show that virtually all current models display surprising sensitivity to these differences, presenting an open challenge for building robust perception systems.
Moving forward, we propose to build a new, diverse dataset for measuring progress in detection and tracking. Current video object detection datasets span a limited set of objects and environments, focusing largely on people, vehicles, and animals. We aim to collect and annotate a large-scale evaluation dataset of varied scenes where all moving objects are annotated. Further, we propose three directions for building more powerful approaches for such data: learning to track objects with only image supervision, replacing tracking heuristics with policies learned on real and synthetic data, and incorporating forecasting for improved tracking.
Thesis Committee Members
Deva Ramanan, Chair
Ross Girshick, Facebook AI Research
Cordelia Schmid, INRIA