Data-Driven Visual Forecasting

PhD Thesis, Tech. Report, CMU-RI-TR-18-12, Robotics Institute, Carnegie Mellon University, April, 2018

View Publication

Abstract

Understanding the temporal dimension of images is a fundamental part of computer vision.
Humans are able to interpret how the entities in an image will change over time. However,
it has only been relatively recently that researchers have focused on visual forecasting —
getting machines to anticipate events in the visual world before they actually happen. This
aspect of vision has many practical implications for tasks ranging from human-computer
interaction to anomaly detection. In addition, temporal prediction can serve as a task for
representation learning, useful for various other recognition problems.

In this thesis, we focus on visual forecasting that is data-driven, self-supervised, and relies
on little to no explicit semantic information. Towards this goal, we explore prediction at
different time frames. We first consider predicting instantaneous pixel motion-optical flow.
We apply convolutional neural networks to predict optical flow in static images. We then
extend this idea to a longer time frame, generalizing to pixel trajectory prediction in space time.
We incorporate models such as variational autoencoders to generate future possible
motions in the scene. After this, we consider a mid-level element approach to forecasting.
By combining a Markovian reasoning framework with an intermediate representation, we
are able to forecast events over longer timescales.

This dissertation then builds upon these ideas towards structured representations for
visual forecasting. Specifically, we aim to reason about the future of images in a structured
state space. Instead of directly predicting events in a low-level feature space such as pixels or
motion, we forecast events in a higher level representation that is still visually meaningful.
This approach confers a number of advantages. It is not restricted by explicit timescales like
motion-based approaches, and, unlike direct pixel-based approaches, predictions are less
likely to “fall off” the manifold of the true visual world.

BibTeX

@phdthesis{Walker-2018-105856,
author = {Jacob Walker},
title = {Data-Driven Visual Forecasting},
year = {2018},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-18-12},
keywords = {Video Forecasting, Variational Autoencoders, Forecasting, Vision, Generative Adversarial Networks},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.