Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity

Paul Komarek and Andrew Moore
tech. report CMU-RI-TR-05-27, Robotics Institute, Carnegie Mellon University, May, 2005


Download
  • Adobe portable document format (pdf) (215KB)
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Abstract
Binary classification is a core data mining task. For large datasets or real-time applications, desirable classifiers are accurate, fast, and automatic (i.e. no parameter tuning). Naive Bayes and decision trees are fast and parameter-free, but their accuracy is often below state-of-the-art. Linear support vector machines (SVM) are fast and have good accuracy, but current implementations are sensitive to the capacity parameter. SVMs with radial basis function kernels are accurate but slow, and have multiple parameters that require tuning.

In this paper we demonstrate that a very simple parameter-free implementation of logistic regression (LR) is sufficiently accurate and fast to compete with state-of-the-art binary classifiers on large real-world datasets. The accuracy is comparable to per-dataset tuned linear SVMs and, in higher dimensions, to tuned RBF SVMs. A combination of regularization, truncated-Newton methods, and iteratively re-weighted least squares make this implementation faster than SVMs and relatively insensitive to parameters. Our fitting procedure, TR-IRLS, appears to outperform several common LR fitting procedures in our experiments. TR-IRLS is robust to linear dependencies and scaling problems in the data, and no data preprocessing is necessary. TR-IRLS is easy to implement and can be used anywhere that IRLS is used. Convergence guarantees can be stated for generalized linear models with canonical links.


Keywords
logistic regression, scalable algorithms, iteratively reweighted least squares, binary classification

Notes
Associated Lab(s) / Group(s): Auton Lab
Associated Project(s): Auton Project
Number of pages: 13

Text Reference
Paul Komarek and Andrew Moore, "Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity," tech. report CMU-RI-TR-05-27, Robotics Institute, Carnegie Mellon University, May, 2005

BibTeX Reference
@techreport{Komarek_2005_5029,
   author = "Paul Komarek and Andrew Moore",
   title = "Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity",
   booktitle = "",
   institution = "Robotics Institute",
   month = "May",
   year = "2005",
   number= "CMU-RI-TR-05-27",
   address= "Pittsburgh, PA",
}