master's thesis, tech. report CMU-RI-TR-04-22, Robotics Institute, Carnegie Mellon University, March, 2004
|The problem of detecting aliases - multiple text string identifiers corresponding to the same entity - is increasingly important in the domains of biology, intelligence, marketing, and geoinformatics. This report investigates the extent to which probabilistic methods can help.
Aliases arise from entities who are trying to hide their identities, from a person with multiple names, or from words which are unintentionally or even intentionally misspelled. While purely orthographic methods (e.g. string similarity) can help solve unintentional spelling cases, many types of alias (including those adopted with malicious intent) can fool these methods.
However, if an entity has a changed name in some context, several or all of the set of other entities with which it has relationships can remain stable. Thus, the local social network can be exploited by using the relationships as semantic information.
The proposed combined algorithm takes advantage of both orthographic and semantic information to detect aliases. By applying the best combination of both types of information, the combined algorithm outperforms the ones built solely on one type of information or the other. Empirical results on three real world data sets support this claim.
Thesis Supervisor: Dr. Andrew W. Moore, A. Nico Habermann Professor
Associated Lab(s) / Group(s):
Associated Project(s): Auton Project
|Paul Hsiung, "Alias Detection in Link Data Sets," master's thesis, tech. report CMU-RI-TR-04-22, Robotics Institute, Carnegie Mellon University, March, 2004|
author = "Paul Hsiung",
title = "Alias Detection in Link Data Sets",
booktitle = "",
school = "Robotics Institute, Carnegie Mellon University",
month = "March",
year = "2004",
address= "Pittsburgh, PA",
|The Robotics Institute is part of the School of Computer Science, Carnegie Mellon University.|
Contact Us | Update Instructions