Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.07851
Cited By
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
15 January 2024
Heming Xia
Zhe Yang
Qingxiu Dong
Peiyi Wang
Yongqi Li
Tao Ge
Tianyu Liu
Wenjie Li
Zhifang Sui
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding"
50 / 86 papers shown
Title
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge
Jianhua Gao
Qizhi Jiang
Yifei Feng
Weixing Ji
34
0
0
13 May 2025
Overflow Prevention Enhances Long-Context Recurrent LLMs
Assaf Ben-Kish
Itamar Zimerman
M. Jehanzeb Mirza
James R. Glass
Leonid Karlinsky
Raja Giryes
LRM
29
0
0
12 May 2025
Accelerating Large Language Model Reasoning via Speculative Search
Zhihai Wang
Jie Wang
Jilai Pan
Xilin Xia
Huiling Zhen
M. Yuan
Jianye Hao
Feng Wu
ReLM
LRM
59
0
0
03 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks
Yang Liu
Bingjie Yan
Tianyuan Zou
Jianqing Zhang
Zixuan Gu
...
J. Li
Xiaozhou Ye
Ye Ouyang
Qiang Yang
Wenjie Qu
ALM
155
1
0
24 Apr 2025
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
Shengyuan Ye
Bei Ouyang
Liekang Zeng
Tianyi Qian
Xiaowen Chu
Jian Tang
Xu Chen
29
1
0
11 Apr 2025
SD
2
^2
2
: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Vithursan Thangarasa
143
1
0
10 Apr 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Sanjit Neelam
Daniel Heinlein
Vaclav Cvicek
Akshay Mishra
Reiner Pope
LRM
38
0
0
08 Apr 2025
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavaram
LRM
31
0
0
08 Apr 2025
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
Haofei Yin
Mengbai Xiao
Rouzhou Lu
Xiao Zhang
Dongxiao Yu
Guanghui Zhang
AI4CE
24
0
0
05 Apr 2025
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Aayush Gautam
Susav Shrestha
Narasimha Annapareddy
48
0
0
28 Mar 2025
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
Zuan Xie
Yang Xu
Hongli Xu
Yunming Liao
Zhiwei Yao
51
0
0
23 Mar 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
131
0
0
17 Mar 2025
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Jiajun Li
Yixing Xu
Haiduo Huang
Xuanwu Yin
D. Li
Edith C. -H. Ngai
E. Barsoum
56
0
0
13 Mar 2025
Collaborative Speculative Inference for Efficient LLM Inference Serving
Luyao Gao
Jianchun Liu
Hongli Xu
Xichong Zhang
Yunming Liao
Liusheng Huang
46
0
0
13 Mar 2025
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
Kaiyu Huang
Yu Wang
Zhubo Shi
Han Zou
Minchen Yu
Qingjiang Shi
LRM
36
1
0
07 Mar 2025
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
117
5
0
03 Mar 2025
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Kai Lv
Honglin Guo
Qipeng Guo
Xipeng Qiu
41
0
0
02 Mar 2025
Tutorial Proposal: Speculative Decoding for Efficient LLM Inference
Heming Xia
Cunxiao Du
Yongqian Li
Qian Liu
Wenjie Li
39
0
0
01 Mar 2025
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
Zhaoxuan Wu
Zijian Zhou
Arun Verma
Alok Prakash
Daniela Rus
Bryan Kian Hsiang Low
60
0
0
24 Feb 2025
DReSD: Dense Retrieval for Speculative Decoding
Milan Gritta
Huiyin Xue
Gerasimos Lampouras
RALM
100
0
0
24 Feb 2025
Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
Zhongqi Wang
Muneeza Azmart
Ang Li
R. Horesh
Mikhail Yurochkin
118
1
0
11 Feb 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Sukmin Cho
S. Choi
T. Hwang
Jeongyeon Seo
Soyeong Jeong
Huije Lee
Hoyun Song
Jong C. Park
Youngjin Kwon
51
0
0
08 Feb 2025
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
Xiangxiang Gao
Weisheng Xie
Yiwei Xiang
Feng Ji
82
5
0
17 Dec 2024
Constrained Decoding with Speculative Lookaheads
Nishanth Nakshatri
Shamik Roy
Rajarshi Das
Suthee Chaidaroon
Leonid Boytsov
Rashmi Gangadharaiah
72
0
0
09 Dec 2024
PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
Shwetha Somasundaram
Anirudh Phukan
Apoorv Saxena
77
1
0
02 Dec 2024
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Zhuofan Wen
Shangtong Gui
Yang Feng
96
2
0
25 Nov 2024
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Hyun Ryu
Eric Kim
74
3
0
20 Nov 2024
SAM Decoding: Speculative Decoding via Suffix Automaton
Yuxuan Hu
Ke Wang
Jing Zhang
Fanjin Zhang
C. Li
H. Chen
Jing Zhang
44
1
0
16 Nov 2024
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
LRM
39
2
0
08 Nov 2024
Privacy Risks of Speculative Decoding in Large Language Models
Jiankun Wei
Abdulrahman Abdulrazzag
Tianchen Zhang
Adel Muursepp
Gururaj Saileshwar
35
2
0
01 Nov 2024
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata
Siddhisanket Raskar
B. Kale
Farah Ferdaus
Aditya Tanikanti
Ken Raffenetti
Valerie Taylor
M. Emani
V. Vishwanath
39
7
0
31 Oct 2024
A Theoretical Perspective for Speculative Decoding Algorithm
Ming Yin
Minshuo Chen
Kaixuan Huang
Mengdi Wang
32
4
0
30 Oct 2024
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Shenghao Xie
Wenqiang Zu
Mingyang Zhao
Duo Su
Shilong Liu
Ruohua Shi
Guoqi Li
Shanghang Zhang
Lei Ma
LRM
44
3
0
29 Oct 2024
Dynamic layer selection in decoder-only transformers
Theodore Glavas
Joud Chataoui
Florence Regol
Wassim Jabbour
Antonios Valkanas
Boris N. Oreshkin
Mark J. Coates
AI4CE
26
0
0
26 Oct 2024
AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
Bradley McDanel
LRM
20
2
0
22 Oct 2024
Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement
Yuxuan Liu
Wenyuan Li
Laizhong Cui
Hailiang Yang
OffRL
29
0
0
17 Oct 2024
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Vithursan Thangarasa
Ganesh Venkatesh
Mike Lasby
Nish Sinnadurai
Sean Lie
SyDa
38
1
0
13 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
130
2
0
09 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia
Yongqi Li
Jun Zhang
Cunxiao Du
Wenjie Li
LRM
48
5
0
09 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
39
5
0
08 Oct 2024
Efficient Inference for Large Language Model-based Generative Recommendation
Xinyu Lin
Chaoqun Yang
Wenjie Wang
Yongqi Li
Cunxiao Du
Fuli Feng
See-Kiong Ng
Tat-Seng Chua
67
4
0
07 Oct 2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao
Z. Yao
Samyam Rajbhandari
Yuxiong He
VLM
32
0
0
04 Oct 2024
Mixture of Attentions For Speculative Decoding
Matthieu Zimmer
Milan Gritta
Gerasimos Lampouras
Haitham Bou Ammar
Jun Wang
76
4
0
04 Oct 2024
Speculative Coreset Selection for Task-Specific Fine-tuning
Xiaoyu Zhang
Juan Zhai
Shiqing Ma
Chao Shen
Tianlin Li
Weipeng Jiang
Yang Liu
30
1
0
02 Oct 2024
House of Cards: Massive Weights in LLMs
Jaehoon Oh
Seungjun Shin
Dokwan Oh
35
1
0
02 Oct 2024
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman
Aviv Shamsian
Aviv Navon
Gill Hetz
Joseph Keshet
27
1
0
24 Sep 2024
A-VL: Adaptive Attention for Large Vision-Language Models
Junyang Zhang
Mu Yuan
Ruiguang Zhong
Puhan Luo
Huiyou Zhan
Ningkang Zhang
Chengchen Hu
Xiangyang Li
VLM
43
1
0
23 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
63
23
0
10 Sep 2024
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui
Bin Xiao
Lei Su
Weipeng Chen
30
2
0
28 Aug 2024
1
2
Next