Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.12374
Cited By
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
19 February 2024
Zhuoming Chen
Avner May
Ruslan Svirschevski
Yuhsun Huang
Max Ryabinin
Zhihao Jia
Beidi Chen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding"
35 / 35 papers shown
Title
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Yunlong Hou
Fengzhuo Zhang
Cunxiao Du
Xuan Zhang
Jiachun Pan
Tianyu Pang
Chao Du
Vincent Y. F. Tan
Zhuoran Yang
OffRL
12
0
0
21 May 2025
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge
Jianhua Gao
Qizhi Jiang
Yifei Feng
Weixing Ji
41
0
0
13 May 2025
Scaling Laws for Speculative Decoding
Siyuan Yan
Mo Zhu
Guo-qing Jiang
Jianfei Wang
Jiaxing Chen
...
Xiang Liao
Xiao Cui
Chen Zhang
Zhuoran Song
Ran Zhu
LRM
48
0
0
08 May 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Ziqiang Liu
Dong Li
E. Barsoum
61
0
0
23 Apr 2025
Collaborative Speculative Inference for Efficient LLM Inference Serving
Luyao Gao
Jianchun Liu
Hongli Xu
Xichong Zhang
Yunming Liao
Liusheng Huang
46
0
0
13 Mar 2025
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman
Yukun Huang
Bhuwan Dhingra
46
0
0
28 Feb 2025
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Y. Hu
Zining Liu
Zhenyuan Dong
Tianfan Peng
Bradley McDanel
S. Zhang
93
0
0
27 Feb 2025
Towards Optimal Multi-draft Speculative Decoding
Zhibo Hu
Tong Zheng
Vignesh Viswanathan
Ziyi Chen
Ryan Rossi
Yihan Wu
Dinesh Manocha
Heng Huang
47
3
0
26 Feb 2025
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
Yepeng Weng
Dianwen Mei
Huishi Qiu
Xujie Chen
Li Liu
Jiang Tian
Zhongchao Shi
53
0
0
24 Feb 2025
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models
A. Narayan
D. Biderman
Sabri Eyuboglu
Avner May
Scott W. Linderman
James Zou
Christopher Ré
60
1
0
21 Feb 2025
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
Shijing Hu
Jingyang Li
Xingyu Xie
Zhihui Lu
Kim-Chuan Toh
Pan Zhou
48
0
0
16 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
MQ
66
1
0
05 Feb 2025
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
Gregor Bachmann
Sotiris Anagnostidis
Albert Pumarola
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Edgar Schönfeld
Ali K. Thabet
Jonas Kohler
ALM
BDL
111
8
0
31 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhuoming Chen
Sean Lai
Xinhao Cheng
Xupeng Miao
Zhihao Jia
53
6
0
21 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
67
21
0
02 Jan 2025
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
119
1
0
18 Dec 2024
Constrained Decoding with Speculative Lookaheads
Nishanth Nakshatri
Shamik Roy
Rajarshi Das
Suthee Chaidaroon
Leonid Boytsov
Rashmi Gangadharaiah
84
0
0
09 Dec 2024
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
LRM
47
2
0
08 Nov 2024
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
Ryan Sun
Tianyi Zhou
Xun Chen
Lichao Sun
34
5
0
08 Nov 2024
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Jiahao Qiu
Yifu Lu
Yifan Zeng
Jiacheng Guo
Jiayi Geng
Huazheng Wang
Kaixuan Huang
Yue Wu
Mengdi Wang
56
22
0
18 Oct 2024
Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement
Yuxuan Liu
Wenyuan Li
Laizhong Cui
Hailiang Yang
OffRL
34
0
0
17 Oct 2024
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
Yunfan Xiong
Ruoyu Zhang
Yanzeng Li
Tianhao Wu
Lei Zou
40
5
0
15 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
44
5
0
08 Oct 2024
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Jinhao Li
Jiaming Xu
Shan Huang
Yonghua Chen
Wen Li
...
Jiayi Pan
Li Ding
Hao Zhou
Yu Wang
Guohao Dai
62
17
0
06 Oct 2024
Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park
Kailash Budhathoki
Liangfu Chen
Jonas M. Kübler
Jiaji Huang
Matthäus Kleindessner
Jun Huan
V. Cevher
Yida Wang
George Karypis
45
3
0
12 Jul 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
46
13
0
25 Jun 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
93
55
0
24 Jun 2024
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
Xiaoxuan Liu
Cade Daniel
Langxiang Hu
Woosuk Kwon
Zhuohan Li
Xiangxi Mo
Alvin Cheung
Zhijie Deng
Ion Stoica
Hao Zhang
24
18
0
20 Jun 2024
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong
Manasa Bharadwaj
49
5
0
30 May 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang
Xudong Guo
Mengdi Wang
47
20
0
30 May 2024
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao Mark Chen
Wayne Luk
Ka-Fai Cedric Yiu
Rui Li
Konstantin Mishchenko
Stylianos I. Venieris
Hongxiang Fan
47
7
0
28 May 2024
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models
Yunsheng Ni
Chuanjian Liu
Yehui Tang
Kai Han
Yunhe Wang
31
0
0
13 May 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
46
49
0
18 Apr 2024
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
149
371
0
13 Mar 2023
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler
Dan Alistarh
Tal Ben-Nun
Nikoli Dryden
Alexandra Peste
MQ
147
685
0
31 Jan 2021
1