Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance

Hironori Hattori, Namhoon Lee, Vishnu Naresh Boddeti, Fares Beainy, Kris M. Kitani, and Takeo Kanade

Journal Article, International Journal of Computer Vision, Vol. 126, No. 9, pp. 1027 - 1044, September, 2018

Abstract

We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.

BibTeX

@article{Hattori-2018-109764,
author = {Hironori Hattori and Namhoon Lee and Vishnu Naresh Boddeti and Fares Beainy and Kris M. Kitani and Takeo Kanade},
title = {Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance},
journal = {International Journal of Computer Vision},
year = {2018},
month = {September},
volume = {126},
number = {9},
pages = {1027 - 1044},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.