The Robotics Institute
Search the site
RI | Publications | Probabilistic Noise Identification and Data Cleaning

Text only version of this site

Probabilistic Noise Identification and Data Cleaning
J.M. Kubica and A. Moore
The Third IEEE International Conference on Data Mining, IEEE Computer Society, November, 2003, pp. 131-138.

Jump to: Download | Abstract | Notes | Text Reference | BibTeX Reference

Download [Help]

Adobe portable document format (pdf) [69 KB]
Compressed postscript (ps.gz) [56 KB]

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Abstract

Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions. Unfortunately, if only certain fields in a record have been corrupted then usable, uncorrupted data will be lost. In this paper we present LENS, an approach for identifying corrupted fields and using the remaining non-corrupted fields for subsequent modeling and analysis. Our approach uses the data to learn a probabilistic model containing three components: a generative model of the clean records, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and, as one example application, the resulting improvement this gives to a classifier.

Notes

Number of pages: 8

Text Reference

J.M. Kubica and A. Moore, "Probabilistic Noise Identification and Data Cleaning," The Third IEEE International Conference on Data Mining, IEEE Computer Society, November, 2003, pp. 131-138.

BibTeX Reference

@inproceedings{Kubica_2003_4547,
   author = "Jeremy Martin Kubica and Andrew Moore",
   title = "Probabilistic Noise Identification and Data Cleaning",
   booktitle = "The Third IEEE International Conference on Data Mining",
   month = "November",
   year = "2003",
   pages = "131-138",
   publisher = "IEEE Computer Society"
}


The Robotics Institute is part of the School of Computer Science, Carnegie Mellon University.
For updates and comments, please see these instructions.
This page maintained by robotwebmaster@ri.cmu.edu