Learning to Understand People via Local, Global and Temporal Reasoning

Rohit Girdhar
PhD Thesis, Tech. Report, CMU-RI-TR-19-52, August, 2019

View Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


People are one of the most important entities that AI systems would need to understand to be useful and ubiquitous. From autonomous cars observing pedestrians to assistive robots helping the elderly, a large part of this understanding is focused on recognizing human actions, and potentially, their intentions. Humans themselves are quite good at this task: we can look at a person and explain in great detail every action they are doing. Moreover, we can reason over those actions over time, and even predict what potential actions they may intend to do in the future. Computer vision algorithms, on the other hand, have lagged far behind on this task.

In this thesis, we explore techniques to improve human action understanding from visual input. Our key insight is that actions are dependent on the global state of the person’s environment (parameterized by the scene, objects and other people in it), apart from their own local state (parameterized by their pose). Additionally, modeling the temporal state of the actors and their environments (motion of people and objects in the scene) can further help in recognizing human actions. We exploit these dependencies in three key ways: (1) Detecting, tracking, and using the actors’ pose to attend to the actors and their context; (2) Using this context to place a prior over possible actions in the scene; and (3) Building systems capable of learning from and aggregating this local and contextual information over time, to recognize human actions.

However, these methods still mostly look at short time scales. Tackling the goal of recognizing human intentions would require reasoning over long temporal horizons. One reason for the limited progress in this direction is the lack of computer vision benchmarks that actually require such reasoning. Most video action classification problems are solved fairly well using our previously explored methods by looking at just a few frames. Hence, to remedy that, we propose a new benchmark dataset and tasks that, by design, require reasoning over time to be solved. We believe this would be a first step towards building truly intelligent video understanding systems.

author = {Rohit Girdhar},
title = {Learning to Understand People via Local, Global and Temporal Reasoning},
year = {2019},
month = {August},
school = {},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-52},
keywords = {action recognition; computer vision; human understanding; video understanding; machine learning; deep learning for videos},
} 2019-08-16T14:26:28-04:00