ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1703.02375
21
3

Graph sketching-based Space-efficient Data Clustering

7 March 2017
Anne Morvan
K. Choromanski
Cédric Gouy-Pailler
Jamal Atif
ArXivPDFHTML
Abstract

In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing \emph{high space constraints}, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based \emph{non-parametric} method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a \emph{sketched} version of the dissimilarity graph G\mathcal{G}G between the NNN objects to cluster. Unlike kkk-means, kkk-medians or kkk-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic \emph{semi-streaming} model in handling the dissimilarity graph G\mathcal{G}G as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring O(Npolylog⁡(N))O(N \operatorname{polylog}(N))O(Npolylog(N)) space, far better than the theoretical memory cost O(N2)O(N^2)O(N2) of G\mathcal{G}G. The recovered approximate MST T\mathcal{T}T as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on T\mathcal{T}T in a time linear in NNN. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

View on arXiv
Comments on this paper