Degrees of Freedom and Model Selection for k-means Clustering

This paper investigates the model degrees of freedom in k-means clustering. An extension of Stein's lemma provides an expression for the effective degrees of freedom in the k-means model. Approximating the degrees of freedom in practice requires simplifications of this expression, however empirical studies evince the appropriateness of our proposed approach. The practical importance of this new degrees of freedom formulation for k-means is demonstrated through model selection using the Bayesian Information Criterion. The reliability of this method is validated through experiments on a large collection of benchmark data sets from diverse application areas. Comparisons with popular existing techniques indicate that this approach is extremely competitive for selecting high quality clustering solutions. An R package implementing the proposed approach is available from https://github.com/DavidHofmeyr/edfkmeans.
View on arXiv