Revisiting k-means: New Algorithms via Bayesian Nonparametrics
One of the many benefits of Bayesian nonparametric processes such as the Dirichlet process is that they can be used for modeling infinite mixture models, thus providing a flexible answer to the question of how many clusters exist in a data set. For the most part, such flexibility is currently lacking in techniques based on hard clustering, such as k-means, graph cuts, and Bregman hard clustering. For finite mixture models, there is a precise connection between k-means and mixtures of Gaussians, obtained by an appropriate limiting argument. In this paper, we apply a similar technique to an infinite mixture arising from the Dirichlet process (DP). We show that a Gibbs sampling algorithm for DP mixtures approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like objective that includes a penalty term based on the number of clusters. We generalize our analysis to the case of clustering multiple related data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We discuss additional extensions that further highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that requires O(|E|) time per iteration and automatically determines the number of clusters in a graph.
View on arXiv