Applying Machine Learning for High Performance Named-Entity Extraction

Shumeet Baluja, Vibhu Mittal, and Rahul Sukthankar

Conference Paper, Proceedings of Pacific Association for Computtational Linguistics Conference (PACLING '99), August, 1999

View Publication

Abstract

This paper describes a machine learning approach to build an efficient, accurate and fast name spotting system. Finding names in free text is an important task in addressing real-world text-based applications. Most previous approaches have been based on carefully hand-crafted modules encoding linguistic knowledge specific to the language and document genre. Such approaches have two drawbacks: they require large amounts of time and linguistic expertise to develop, and they are not easily portable to new languages and genres. This paper describes an extensible system which automatically combines weak evidence for name extraction. This evidence is gathered from easily available sources: part-of-speech tagging, dictionary lookups, and textual information such as capitalization and punctuation. Individually, each piece of evidence is insuFFIcient for robust name detection. However, the combination of evidence, through standard machine learning techniques, yields a system that achieves performance equivalent to the best existing hand-crafted approaches.

BibTeX

@conference{Baluja-1999-16692,
author = {Shumeet Baluja and Vibhu Mittal and Rahul Sukthankar},
title = {Applying Machine Learning for High Performance Named-Entity Extraction},
booktitle = {Proceedings of Pacific Association for Computtational Linguistics Conference (PACLING '99)},
year = {1999},
month = {August},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.