Revisiting k-means: New Algorithms via Bayesian Nonparametrics

International Conference on Machine Learning (ICML), 2011

2 November 2011

Abstract

One of the many benefits of Bayesian nonparametric processes such as the Dirichlet process is that they can be used for modeling infinite mixture models, thus providing a flexible answer to the question of how many clusters exist in a data set. For the most part, such flexibility is currently lacking in techniques based on hard clustering, such as k-means, graph cuts, and Bregman hard clustering. For finite mixture models, there is a precise connection between k-means and mixtures of Gaussians, obtained by an appropriate limiting argument. In this paper, we apply a similar technique to an infinite mixture arising from the Dirichlet process (DP). We show that a Gibbs sampling algorithm for DP mixtures approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like objective that includes a penalty term based on the number of clusters. We generalize our analysis to the case of clustering multiple related data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We discuss additional extensions that further highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that requires O(|E|) time per iteration and automatically determines the number of clusters in a graph.

View on arXiv

Comments on this paper