Learning with Clusters

PhD Thesis, Tech. Report, CMU-RI-TR-19-02, Robotics Institute, Carnegie Mellon University, January, 2019

View Publication

Abstract

Clustering, the problem of grouping similar data, has been extensively studied since at least the 1950's. As machine learning becomes more prominent, clustering has evolved from primarily a data analysis tool into an integrated component of complex robotic and machine learning systems, including those involving dimensionality reduction, anomaly detection, network analysis, image segmentation and classifying groups of data.

With this integration into multi-stage systems comes a need to better understand interactions between pipeline components. Changing parameters of the clustering algorithm will impact downstream components and, quite unfortunately, it is usually not possible to simply backpropagate through the entire system. Instead, it is common practice to take the output of the clustering algorithm as ground truth at the next module of the pipeline. We show this false assumption causes subtle and dangerous behavior for even the simplest systems -- empirically biasing results by upwards of 25%.

We address this gap by developing scalable estimators and methods to both quantify and compensate the impact of clustering errors on downstream learners. Our work is agnostic to the choice of other components of the machine learning systems, and requires few assumptions on the clustering algorithm. Theoretical and empirical results demonstrate our methods and estimators are superior to the current naive approaches, which do not account for clustering errors.

We also develop several new clustering algorithms and prove theoretical bounds for existing algorithms, to be used as inputs to our error-correction methods. Not surprisingly, we find that learning on clusters of data is both theoretically and empirically easier as the number of clustering errors decreases. Thus, our work is two-fold. We attempt to provide the best clustering possible as well as establish how to effectively learn on inevitably noisy clusters.

BibTeX

@phdthesis{Barnes-2019-110838,
author = {Matthew Barnes},
title = {Learning with Clusters},
year = {2019},
month = {January},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-19-02},
keywords = {Clustering; Record linkage; Statistical estimators; Interaction effects; Machine learning; Out-of-cluster error; Graph clustering},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.