Amodal Visual Scene Representations With and Without Geometry - Robotics Institute Carnegie Mellon University

Amodal Visual Scene Representations With and Without Geometry

Adam W. Harley
PhD Thesis, Tech. Report, CMU-RI-TR-22-13, Robotics Institute, Carnegie Mellon University, May, 2022

Abstract

Most computer vision models in deployment today describe the pixels of images. This does not suffice, because images are only projections of the scene in front of the camera. In this thesis we build representations that attempt to describe the scene itself. We call these representations ``amodal'' (i.e., without modality), emphasizing the fact that they describe elements of the scene for which we have no sensory input.

We present two methods for amodal visual scene representation. The first focuses on modeling space, and proposes geometry-based methods for lifting images into 3D maps, where the objects are complete, despite partial occlusions in the imagery. We show that this representation allows for self-supervised learning from multi-view data, and yields state-of-the-art results as a perception system for autonomous vehicles, where the goal is to estimate a ``bird's eye view'' semantic map from multiple sensors. The second method focuses on modeling time, and proposes geometry-free methods for tracking image elements through partial and full occlusions across a video. Using learned temporal priors and within-inference optimization, we show that our model can track points across occlusions, and outperform flow-based and feature-matching methods on fine-grained multi-frame correspondence tasks.

BibTeX

@phdthesis{Harley-2022-131681,
author = {Adam W. Harley},
title = {Amodal Visual Scene Representations With and Without Geometry},
year = {2022},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-13},
keywords = {amodal; completion; representation learning; geometry-based methods; trajectories; tracking},
}