The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data

Andrew Moore

Tech. Report, CMU-RI-TR-00-05, Robotics Institute, Carnegie Mellon University, February, 2000

View Publication

Abstract

This paper is about the use of metric data structures in high-dimensional or non-Euclidean space to permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy---a fast data structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a well-balanced structure similar to a Ball-Tree (Omohundro 1990) or a kind of metric tree (Uhlmann 1991)in a way that is neither "top-down'' nor "bottom-up'' but instead "middle-out". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.

BibTeX

@techreport{Moore-2000-7978,
author = {Andrew Moore},
title = {The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data},
year = {2000},
month = {February},
institute = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-00-05},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.