30
8

Greedy bi-criteria approximations for kk-medians and kk-means

Abstract

This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of kk-medians and kk-means, the key results are as follows. \bullet When the method considers all data points as candidate centers, then selecting O(klog(1/ε))\mathcal{O}(k\log(1/\varepsilon)) centers achieves cost at most 2+ε2+\varepsilon times the optimal cost with kk centers. \bullet Alternatively, the same guarantees hold if each round samples O(k/ε5)\mathcal{O}(k/\varepsilon^5) candidate centers proportionally to their cluster cost (as with kmeans++\texttt{kmeans++}, but holding centers fixed). \bullet In the case of kk-means, considering an augmented set of n1/εn^{\lceil1/\varepsilon\rceil} candidate centers gives 1+ε1+\varepsilon approximation with O(klog(1/ε))\mathcal{O}(k\log(1/\varepsilon)) centers, the entire algorithm taking O(dklog(1/ε)n1+1/ε)\mathcal{O}(dk\log(1/\varepsilon)n^{1+\lceil1/\varepsilon\rceil}) time, where nn is the number of data points in Rd\mathbb{R}^d. \bullet In the case of Euclidean kk-medians, generating a candidate set via nO(1/ε2)n^{\mathcal{O}(1/\varepsilon^2)} executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation 1+ε1+\varepsilon with O(klog(1/ε))\mathcal{O}(k\log(1/\varepsilon)) centers in dklog(1/ε)nO(1/ε2)dk\log(1/\varepsilon)n^{\mathcal{O}(1/\varepsilon^2)} time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against kmeans++\texttt{kmeans++}; data-dependent bounds allowing 1+ε1+\varepsilon in the first two bullets above, for example with kk-medians over finite metric spaces.

View on arXiv
Comments on this paper