Alias Detection in Link Data Sets

Paul Hsiung, Andrew Moore, Daniel Neill, and Jeff Schneider

Conference Paper, Proceedings of International Conference on Intelligence Analysis (IA '05), May, 2005

Abstract

The problem of detecting aliases - multiple text string identifiers corresponding to the same entity - is increasingly important in the domains of biology, intelligence, marketing, and geoinformatics. Aliases arise from entities who are trying to hide their identities, from a person with multiple names, or from words which are uninten-tionally or even intentionally misspelled. While purely orthographic methods (e.g. string similarity) can help solve unintentional spelling cases, many types of alias (including those adopted with malicious intent) can fool these methods. However, if an entity has a changed name in some context, several or all of the set of other entities with which it has relationships can re-main stable. Thus, the local social network can be exploited by using the relationships as semantic information. The proposed combined algorithm takes ad-vantage of both orthographic and semantic information to detect aliases. By applying the best combination of both types of information, the combined algorithm outperforms the ones built solely on one type of information or the other. Empirical results on three real world data sets support this claim.

BibTeX

@conference{Hsiung-2005-9172,
author = {Paul Hsiung and Andrew Moore and Daniel Neill and Jeff Schneider},
title = {Alias Detection in Link Data Sets},
booktitle = {Proceedings of International Conference on Intelligence Analysis (IA '05)},
year = {2005},
month = {May},
publisher = {International Conference on Intelligence Analysis},
keywords = {Link Analysis, Applications, Active Learning},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.