Greedy bi-criteria approximations for -medians and -means

This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of -medians and -means, the key results are as follows. When the method considers all data points as candidate centers, then selecting centers achieves cost at most times the optimal cost with centers. Alternatively, the same guarantees hold if each round samples candidate centers proportionally to their cluster cost (as with , but holding centers fixed). In the case of -means, considering an augmented set of candidate centers gives approximation with centers, the entire algorithm taking time, where is the number of data points in . In the case of Euclidean -medians, generating a candidate set via executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation with centers in time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against ; data-dependent bounds allowing in the first two bullets above, for example with -medians over finite metric spaces.
View on arXiv