Annotation of Utterances for Conversational Nonverbal Behaviors

Allison Funkhouser
Master's Thesis, Tech. Report, CMU-RI-TR-16-25, Robotics Institute, Carnegie Mellon University, May, 2016

View Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Nonverbal behaviors play an important role in communication for both humans and social robots. However, hiring trained roboticists and animators to individually animate every possible piece of dialogue is time consuming and does not scale well. This has motivated previous researchers to develop automated systems for inserting appropriate nonverbal behaviors into utterances based only on the text of the dialogue. Yet this automated strategy also has drawbacks, because there is basic semantic information that humans can easily identify that is not yet accurately captured by a purely automated system. Identifying the dominant emotion of a sentence, locating words that should be emphasized by beat gestures, and inferring the next speaker in a turn-taking scenario are all examples of data that would be useful when animating an utterance but which are difficult to determine automatically. This work proposes a middle ground between hand-tuned animation and a purely text-based system. Instead, untrained human workers label relevant semantic information for an utterance. These labeled sentences are then used by an automated system to produce fully animated dialogue. In this way, the relevant human-identifiable context of a scenario is preserved without requiring workers to have deep expertise of the intricacies of nonverbal behavior. Because the semantic information is independent of the robotic platform, workers are also not required to have access to a simulation or physical robot. This makes parallelizing the task much more straightforward, and overall the amount of human work required is reduced. In order to test this labeling strategy, untrained workers from the Amazon Mechanical Turk website were presented with small segments of conversations and asked to answer several questions about the semantic context of the last line of dialogue. Specifically, they selected which emotion best matched the emotion of the sentence and which word should receive the most emphasis. This semantic information was input to an automated system which added animations to the particular utterance. Videos of a social robot performing the dialogue with animations were then presented to a second set of participants, who rated them on scales adapted from the Godspeed Questionnaire Series. Results showed that untrained workers were capable of providing reasonable labeling of semantic information in a presented utterance. When these labels were used to select animations for a social robot, the selected emotive expressions were rated as more natural and anthropomorphic than control groups. More study is needed to determine the effect of the labeled emphasis gestures on perception of robot performance.

author = {Allison Funkhouser},
title = {Annotation of Utterances for Conversational Nonverbal Behaviors},
year = {2016},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-16-25},
} 2017-09-13T10:38:25-04:00