ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1912.00653
182
15

Noisy, Greedy and Not So Greedy k-means++

Embedded Systems and Applications (ESA), 2019
2 December 2019
Anup Bhattacharya
Jan Eube
Heiko Röglin
Melanie Schmidt
ArXiv (abs)PDFHTML
Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii has become the most popular seeding method for Lloyd's algorithm. It samples the first center uniformly at random from the data set and the other k−1k-1k−1 centers iteratively according to D2D^2D2-sampling where the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of O(log⁡k)O(\log k)O(logk) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D2D^2D2-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an O(log⁡k)O(\log k)O(logk)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(ℓ⋅log⁡k)\Omega(\ell\cdot \log k)Ω(ℓ⋅logk)-approximation in expectation where ℓ\ellℓ is the number of possible centers that are sampled in each iteration. We also study a variation, which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D2D^2D2-sampling anymore. Instead in each iteration an adversary is allowed to change the probabilities arising from D2D^2D2-sampling individually for each point by a factor between 1−ϵ11-\epsilon_11−ϵ1​ and 1+ϵ21+\epsilon_21+ϵ2​ for parameters ϵ1∈[0,1)\epsilon_1 \in [0,1)ϵ1​∈[0,1) and ϵ2≥0\epsilon_2 \ge 0ϵ2​≥0. We prove that noisy k-means++ compute an O(log⁡2k)O(\log^2 k)O(log2k)-approximation in expectation. We also discuss some applications of this result.

View on arXiv
Comments on this paper