ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.08173
17
2
v1v2 (latest)

Distributed k-Means with Outliers in General Metrics

16 February 2022
Enrico Dandolo
A. Pietracaprina
G. Pucci
ArXiv (abs)PDFHTML
Abstract

Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set PPP of points from a metric space and a parameter k<∣P∣k<|P|k<∣P∣, requires to determine a subset SSS of kkk centers minimizing the sum of all squared distances of points in PPP from their closest center. A more general formulation, known as k-means with zzz outliers, introduced to deal with noisy datasets, features a further parameter zzz and allows up to zzz points of PPP (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with zzz outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ)O(\gamma)O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ\gammaγ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension DDD of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.

View on arXiv
Comments on this paper