105
1

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Abstract

We propose a novel subset selection task called min-distance diverse data summarization (MDDS\textsf{MDDS}), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint Sk|S| \le k. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the GIST\texttt{GIST} algorithm, which achieves a 23\frac{2}{3}-approximation guarantee for MDDS\textsf{MDDS} by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary (23+ε)(\frac{2}{3}+\varepsilon)-hardness of approximation, for any ε>0\varepsilon > 0. Finally, we provide an empirical study that demonstrates GIST\texttt{GIST} outperforms existing methods for MDDS\textsf{MDDS} on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

View on arXiv
@article{fahrbach2025_2405.18754,
  title={ GIST: Greedy Independent Set Thresholding for Diverse Data Summarization },
  author={ Matthew Fahrbach and Srikumar Ramalingam and Morteza Zadimoghaddam and Sara Ahmadian and Gui Citovsky and Giulia DeSalvo },
  journal={arXiv preprint arXiv:2405.18754},
  year={ 2025 }
}
Comments on this paper