48
64

Consistent procedures for cluster tree estimation and pruning

Abstract

For a density ff on Rd{\mathbb R}^d, a {\it high-density cluster} is any connected component of {x:f(x)λ}\{x: f(x) \geq \lambda\}, for some λ>0\lambda > 0. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of ff. We present two procedures for estimating the cluster tree given samples from ff. The first is a robust variant of the single linkage algorithm for hierarchical clustering. The second is based on the kk-nearest neighbor graph of the samples. We give finite-sample convergence rates for these algorithms which also imply consistency, and we derive lower bounds on the sample complexity of cluster tree estimation. Finally, we study a tree pruning procedure that guarantees, under milder conditions than usual, to remove clusters that are spurious while recovering those that are salient.

View on arXiv
Comments on this paper