22
50

Towards Optimal Lower Bounds for k-median and k-means Coresets

Abstract

Given a set of points in a metric space, the (k,z)(k,z)-clustering problem consists of finding a set of kk points called centers, such that the sum of distances raised to the power of zz of every data point to its closest center is minimized. Special cases include the famous k-median problem (z=1z = 1) and k-means problem (z=2z = 2). The kk-median and kk-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1±ε)(1 \pm \varepsilon) factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of nn points and doubling metrics with doubling constant DD, we show that any coreset for (k,z)(k,z) clustering must consist of at least Ω(kε2logn)\Omega(k \varepsilon^{-2} \log n) and Ω(kε2D)\Omega(k \varepsilon^{-2} D) points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for (k,z)(k,z) clustering must consists of at least Ω(kε2)\Omega(k\varepsilon^{-2}) points. We complement these lower bounds with a coreset construction consisting of at most O~(kε2min(εz,k))\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k)) points.

View on arXiv
Comments on this paper