Mamba Drafters for Speculative Decoding

Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
View on arXiv@article{choi2025_2506.01206, title={ Mamba Drafters for Speculative Decoding }, author={ Daewon Choi and Seunghyuk Oh and Saket Dingliwal and Jihoon Tack and Kyuyoung Kim and Woomin Song and Seojin Kim and Insu Han and Jinwoo Shin and Aram Galstyan and Shubham Katiyar and Sravan Babu Bodapati }, journal={arXiv preprint arXiv:2506.01206}, year={ 2025 } }