Learning Hidden Markov Model Structure for Information Extraction

Kristie Seymore, Andrew McCallum, and Ronald Rosenfeld

Workshop Paper, AAAI '99 Workshop on Machine Learning for Information Extraction, July, 1999

View Publication

Abstract

Statistical machine learning techniques, while well proven in fields such as speech recognition, are just beginning to be applied to the information extraction domain. We explore the use of hidden Markov models for information extraction tasks, specifically focusing on how to learn model structure from data and how to make the best use of labeled and unlabeled data. We show that a manually-constructed model that contains multiple states per extraction field outperforms a model with one state per field, and discuss strategies for learning the model structure automatically from data. We also demonstrate that the use of distantly-labeled data to set model parameters provides a significant improvement in extraction accuracy. Our models are applied to the task of extracting important fields from the headers of computer science research papers, and achieve an extraction accuracy of 92.9%.

BibTeX

@workshop{Seymore-1999-16660,
author = {Kristie Seymore and Andrew McCallum and Ronald Rosenfeld},
title = {Learning Hidden Markov Model Structure for Information Extraction},
booktitle = {Proceedings of AAAI '99 Workshop on Machine Learning for Information Extraction},
year = {1999},
month = {July},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.