GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

We propose a novel subset selection task called min-distance diverse data summarization (), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint . For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the algorithm, which achieves a -approximation guarantee for by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary -hardness of approximation, for any . Finally, we provide an empirical study that demonstrates outperforms existing methods for on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.
View on arXiv@article{fahrbach2025_2405.18754, title={ GIST: Greedy Independent Set Thresholding for Diverse Data Summarization }, author={ Matthew Fahrbach and Srikumar Ramalingam and Morteza Zadimoghaddam and Sara Ahmadian and Gui Citovsky and Giulia DeSalvo }, journal={arXiv preprint arXiv:2405.18754}, year={ 2025 } }