Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

14 April 2020

Abstract

Given a collection of $n$ points in $\mathbb{R}^d$ , the goal of the $(k,z)$ -clustering problem is to find a subset of $k$ "centers" that minimizes the sum of the $z$ -th powers of the Euclidean distance of each point to the closest center. Special cases of the $(k,z)$ -clustering problem include the $k$ -median and $k$ -means problems. Our main result is a unified two-stage importance sampling framework that constructs an $\varepsilon$ -coreset for the $(k,z)$ -clustering problem. Compared to the results for $(k,z)$ -clustering in [Feldman and Langberg, STOC 2011], our framework saves a $\varepsilon^2 d$ factor in the coreset size. Compared to the results for $(k,z)$ -clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a $\operatorname{poly}(k)$ factor in the coreset size and avoids the $\exp(k/\varepsilon)$ term in the construction time. Specifically, our coreset for $k$ -median ( $z=1$ ) has size $\tilde{O}(\varepsilon^{-4} k)$ which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a $k$ factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of $\Omega\left(k\cdot \min \left\{2^{z/20},d \right\}\right)$ for a $0.01$ -coreset for $(k,z)$ -clustering, which has a linear dependence of size on $k$ and an exponential dependence on $z$ that matches our algorithmic results.

View on arXiv

Comments on this paper