Simple, Scalable and Effective Clustering via One-Dimensional Projections

25 October 2023

Abstract

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$ -means++ can take $\Omega(ndk)$ time when clustering $n$ points in a $d$ -dimensional space (represented by an $n\times d$ matrix $X$ ) into $k$ clusters. In applications with moderate to large $k$ , the multiplicative $k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$ . Here $\mathrm{nnz}(X)$ is the total number of non-zero entries in the input dataset $X$ , which is upper bounded by $nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on any input dataset for the $k$ -means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a $k$ -means algorithm is approximately preserved under a class of projections and that $k$ -means++ seeding can be implemented in expected $O(n \log n)$ time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.

View on arXiv

Comments on this paper