Compositional Video Prediction

Yufei Ye, Maneesh Singh, Abhinav Gupta, and Shubham Tulsiani

Conference Paper, Proceedings of (ICCV) International Conference on Computer Vision, pp. 10352 - 10361, October, 2019

View Publication

Abstract

We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight. We implicitly predict future states of independent entities while reasoning about their interactions, and compose future video frames using these predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variable, and show that this allows us to sample diverse and plausible futures. We empirically validate our approach against alternate representations and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and the other containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings. See project website for video predictions.

BibTeX

@conference{Ye-2019-117505,
author = {Yufei Ye and Maneesh Singh and Abhinav Gupta and Shubham Tulsiani},
title = {Compositional Video Prediction},
booktitle = {Proceedings of (ICCV) International Conference on Computer Vision},
year = {2019},
month = {October},
pages = {10352 - 10361},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.