ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.16870
224
1

Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs

21 March 2025
Anshumann
Mohd Abbas Zaidi
Akhil Kedia
Jinwoo Ahn
Taehwak Kwon
Kangwook Lee
Haejun Lee
Joohyung Lee
    FedML
ArXivPDFHTML
Abstract

Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.

View on arXiv
@article{anshumann2025_2503.16870,
  title={ Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs },
  author={ Anshumann and Mohd Abbas Zaidi and Akhil Kedia and Jinwoo Ahn and Taehwak Kwon and Kangwook Lee and Haejun Lee and Joohyung Lee },
  journal={arXiv preprint arXiv:2503.16870},
  year={ 2025 }
}
Comments on this paper