16
2

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Abstract

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and kk-means++ can take Ω(ndk)\Omega(ndk) time when clustering nn points in a dd-dimensional space (represented by an n×dn\times d matrix XX) into kk clusters. In applications with moderate to large kk, the multiplicative kk factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time O(nnz(X)+nlogn)O(\mathrm{nnz}(X) + n\log n) for arbitrary kk. Here nnz(X)\mathrm{nnz}(X) is the total number of non-zero entries in the input dataset XX, which is upper bounded by ndnd and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio O~(k4)\smash{\widetilde{O}(k^4)} on any input dataset for the kk-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a kk-means algorithm is approximately preserved under a class of projections and that kk-means++ seeding can be implemented in expected O(nlogn)O(n \log n) time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.

View on arXiv
Comments on this paper