ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.07848
47
5
v1v2v3 (latest)

Fully Scalable MPC Algorithms for Clustering in High Dimension

15 July 2023
A. Czumaj
Guichen Gao
S. Jiang
Robert Krauthgamer
P. Veselý
ArXiv (abs)PDFHTML
Abstract

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be nσn^{\sigma}nσ for arbitrarily small fixed σ>0\sigma>0σ>0. Importantly, the local memory may be substantially smaller than the number of clusters kkk, yet all our algorithms are fast, i.e., run in O(1)O(1)O(1) rounds. We first devise a fast MPC algorithm for O(1)O(1)O(1)-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves O(1)O(1)O(1)-approximation for any clustering problem in general geometric setting; previous algorithms only provide poly(log⁡n)\mathrm{poly}(\log n)poly(logn)-approximation or apply to restricted inputs, like low dimension or small number of clusters kkk; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves O(1)O(1)O(1)-bicriteria approximation for kkk-Median and for kkk-Means, namely, it computes (1+ε)k(1+\varepsilon)k(1+ε)k clusters of cost within O(1/ε2)O(1/\varepsilon^2)O(1/ε2)-factor of the optimum for kkk clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

View on arXiv
Comments on this paper