Fully Scalable MPC Algorithms for Clustering in High Dimension

15 July 2023

A. Czumaj

Abstract

We design new algorithms for $k$ -clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is $n^{\sigma}$ for arbitrarily small fixed $\sigma>0$ . Importantly, the local memory may be substantially smaller than $k$ . Our algorithms take $O(1)$ rounds and achieve $O(1)$ -bicriteria approximation for $k$ -Median and for $k$ -Means, namely, they compute $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$ -factor of the optimum. Previous work achieves only $\mathrm{poly}(\log n)$ -bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22]. Our results rely on an MPC algorithm for $O(1)$ -approximation of facility location in $O(1)$ rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

View on arXiv

Comments on this paper