/Automatic Analysis of Facial Actions: Learning from Transductive, Supervised and Unsupervised Frameworks

Automatic Analysis of Facial Actions: Learning from Transductive, Supervised and Unsupervised Frameworks

Wen-Sheng Chu
PhD Thesis, Tech. Report, CMU-RI-TR-17-01, Robotics Institute, Carnegie Mellon University, February, 2017

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.


Automatic analysis of facial actions (AFA) can reveal a person’s emotion, intention, and physical state, and make possible a wide range of applications. To enable reliable, valid, and efficient AFA, this thesis investigates automatic analysis of facial actions through transductive, supervised and unsupervised learning.

Supervised learning for AFA is challenging, in part, because of individual differences among persons in face shape and appearance and variation in video acquisition and context. To improve generalizability across persons, we propose a transductive framework, Selective Transfer Machine (STM), which personalizes generic classifiers through joint sample reweighting and classifier learning. By personalizing classifiers, STM offers improved generalization to unknown persons. As an extension, we develop a variant of STM for use when partially labeled data are available.

Additional challenges for supervised learning include learning an optimal representation for classification, variation in base rates of action units (AUs), correlation between AUs and temporal consistency. While these challenges could be partly accommodated with an SVM or STM, a more powerful alternative is afforded by an end-to-end supervised framework (\ie, deep learning). We propose a convolutional network with long short-term memory (LSTM) and multi-label sampling strategies. We compared SVM, STM and deep learning approaches with respect to AU occurrence and intensity in and between BP4D+ and GFT databases, which consist of around 0.6 million annotated frames.

Annotated video is not always possible or desirable. We introduce an unsupervised Branch-and-Bound framework to discover correlated facial actions in unannotated video. We term this approach Common Event Discovery (CED). We evaluate CED in video and motion capture data. CED achieved moderate convergence with supervised approaches and enabled discovery of novel patterns occult to supervised approaches.

BibTeX Reference
author = {Wen-Sheng Chu},
title = {Automatic Analysis of Facial Actions: Learning from Transductive, Supervised and Unsupervised Frameworks},
year = {2017},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-17-01},