40
5

Fully Scalable MPC Algorithms for Clustering in High Dimension

Abstract

We design new algorithms for kk-clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is nσn^{\sigma} for arbitrarily small fixed σ>0\sigma>0. Importantly, the local memory may be substantially smaller than kk. Our algorithms take O(1)O(1) rounds and achieve O(1)O(1)-bicriteria approximation for kk-Median and for kk-Means, namely, they compute (1+ε)k(1+\varepsilon)k clusters of cost within O(1/ε2)O(1/\varepsilon^2)-factor of the optimum. Previous work achieves only poly(logn)\mathrm{poly}(\log n)-bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22]. Our results rely on an MPC algorithm for O(1)O(1)-approximation of facility location in O(1)O(1) rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

View on arXiv
Comments on this paper