Carnegie Mellon Robotics Institute
Jeremy Martin Kubica and Andrew Moore
tech. report CMU-RI-TR-02-26, Robotics Institute, Carnegie Mellon University, October, 2002
| Download |
|
| Abstract |
| Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is outlier detection or anomaly detection in which an algorithm identifies and removes entire suspect records. But if only certain fields in a record have been corrupted then useful uncorrupted data will also be thrown out. In this paper we present an approach for identifying corrupted fields and using the remaining non-corrupted fields for subsequent modeling and analysis. Our approach learns a probabilistic model from the data that contains three components: a generative model of the clean data points, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and the resulting improvement this gives to a classifier. |
| Keywords |
| data mining |
| Notes |
| Text Reference |
| Jeremy Martin Kubica and Andrew Moore, "Probabilistic Noise Identification and Data Cleaning," tech. report CMU-RI-TR-02-26, Robotics Institute, Carnegie Mellon University, October, 2002 |
| BibTeX Reference |
|
@techreport{Kubica_2002_4113, author = "Jeremy Martin Kubica and Andrew Moore", title = "Probabilistic Noise Identification and Data Cleaning", booktitle = "", institution = "Robotics Institute", month = "October", year = "2002", number= "CMU-RI-TR-02-26", address= "Pittsburgh, PA", } |
| The Robotics Institute is part of the School of Computer Science, Carnegie Mellon University. Contact Us | Update Instructions |