Speculative Sampling via Exponential Races

21 April 2025

Abstract

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$ , which serves as an upper bound for all $k$ . We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.

View on arXiv

@article{kobus2025_2504.15475,
  title={ Speculative Sampling via Exponential Races },
  author={ Szymon Kobus and Deniz Gündüz },
  journal={arXiv preprint arXiv:2504.15475},
  year={ 2025 }
}

Comments on this paper