PhD Thesis Proposal
Carnegie Mellon University
4:00 pm - 5:00 pm
Gates Hillman Center 4405
As machine learning becomes more ubiquitous, clustering has evolved from primarily a data analysis tool into an integrated component of complex machine learning systems, including those involving dimensionality reduction, anomaly detection, network analysis, image segmentation and classifying groups of data. With this integration into multi-stage systems comes a need to better understand interactions between pipeline components. Changing parameters of the clustering algorithm will impact downstream components and, quite unfortunately, it is usually not possible to simply back-propagate through the entire system. Currently, as with many machine learning systems, the output of the clustering algorithm is taken as ground truth at the next pipeline step. Our empirical results show this false assumption may have dramatic empirical impacts — sometimes biasing results by upwards of 25%.
We address this gap by developing estimators and methods to both quantify and correct for clustering errors’ impacts on downstream learners. Our work is agnostic to the downstream learners, and requires few assumptions on the clustering algorithm. Theoretical and empirical results demonstrate our methods and estimators are superior to the current naive approaches, which do not account for clustering errors.
Along these lines, we also develop several new clustering algorithms and prove theoretical bounds for existing algorithms, to be used as inputs to our later error-correction methods. Not surprisingly, we find learning on clusters of data is both theoretically and empirically easier as the number of clustering errors decreases. Thus, our work is two-fold. We attempt to both provide the best clustering possible and learn on inevitably noisy clusters.
A major limiting factor in our error-correction methods is scalability. Currently, their computational complexity is O(n^3) where n is the size of the training dataset. This limits their applicability to very small machine learning problems. We propose addressing this scalability issue through approximation. It should be possible to reduce the computational complexity to O(p^3) where p is a small fixed constant and independent of n, corresponding to the number of parameters in the approximation.
Thesis Committee Members:
Artur Dubrawski, Chair
Beka Steorts, Duke University