CONCORD: Concept-Informed Diffusion for Dataset Distillation

23 May 2025

Abstract

Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released inthis https URL.

View on arXiv

@article{gu2025_2505.18358,
  title={ CONCORD: Concept-Informed Diffusion for Dataset Distillation },
  author={ Jianyang Gu and Haonan Wang and Ruoxi Jia and Saeed Vahidian and Vyacheslav Kungurtsev and Wei Jiang and Yiran Chen },
  journal={arXiv preprint arXiv:2505.18358},
  year={ 2025 }
}

Comments on this paper