Detailed Image Captioning

Junjiao Tian
Master's Thesis, Tech. Report, CMU-RI-TR-19-34, Carnegie Mellon University, July, 2019

Download Publication

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


While researchers have made great improvement on generating syn-tactically correct sentences by learning from large image-sentence paired datasets, generating semantically rich and controllable content has remained a major challenge. In image captioning, sequential models are preferred where fluency is an important factor in evaluation,e.g.,n-grammetrics; however, sequential models generally result in
over-generalized expressions that lack the details which may be present in an input image and offer no controllability. In this article, we propose two models to tackle this challenge from different perspective. In the first experiment, we aim to generate more detailed captions by incorporating compositional components into a sequential model. In the second experiment, we explore an attribute-based model with the ability to include selected tag words into a target sentence.

author = {Junjiao Tian},
title = {Detailed Image Captioning},
year = {2019},
month = {July},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-34},
keywords = {Image Captioning, Natural Language Processing},
} 2019-07-01T10:51:38-04:00