ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.04623
  4. Cited By
Accelerating LLM Inference with Staged Speculative Decoding

Accelerating LLM Inference with Staged Speculative Decoding

8 August 2023
Benjamin Spector
Christal Re
ArXivPDFHTML

Papers citing "Accelerating LLM Inference with Staged Speculative Decoding"

50 / 90 papers shown
Title
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Jie Ou
Jinyu Guo
Shuaihong Jiang
Zhaokun Wang
Libo Qin
Shunyu Yao
Wenhong Tian
3DV
22
0
0
19 May 2025
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge
Jianhua Gao
Qizhi Jiang
Yifei Feng
Weixing Ji
39
0
0
13 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
169
0
0
02 May 2025
Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Jikai Wang
J. Li
Lijun Wu
Hao Fei
LLMAG
LRM
69
2
0
27 Apr 2025
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
Haofei Yin
Mengbai Xiao
Rouzhou Lu
Xiao Zhang
Dongxiao Yu
Guanghui Zhang
AI4CE
26
0
0
05 Apr 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
184
0
0
17 Mar 2025
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Y. Hu
Zining Liu
Zhenyuan Dong
Tianfan Peng
Bradley McDanel
S. Zhang
93
0
0
27 Feb 2025
Towards Optimal Multi-draft Speculative Decoding
Towards Optimal Multi-draft Speculative Decoding
Zhibo Hu
Tong Zheng
Vignesh Viswanathan
Ziyi Chen
Ryan Rossi
Yihan Wu
Dinesh Manocha
Heng Huang
47
3
0
26 Feb 2025
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
Zhaoxuan Wu
Zijian Zhou
Arun Verma
Alok Prakash
Daniela Rus
Bryan Kian Hsiang Low
60
0
0
24 Feb 2025
CodeSwift: Accelerating LLM Inference for Efficient Code Generation
CodeSwift: Accelerating LLM Inference for Efficient Code Generation
Qianhui Zhao
L. Zhang
Fang Liu
Xiaoli Lian
Qiaoyuanhe Meng
Ziqian Jiao
Zetong Zhou
Borui Zhang
Runlin Guo
Jia Li
43
0
0
24 Feb 2025
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
Tian Jin
Ellie Y. Cheng
Zack Ankner
Nikunj Saunshi
Blake M. Elias
Amir Yazdanbakhsh
Jonathan Ragan-Kelley
Suvinay Subramanian
Michael Carbin
62
3
0
24 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
43
6
0
21 Feb 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Sukmin Cho
S. Choi
T. Hwang
Jeongyeon Seo
Soyeong Jeong
Huije Lee
Hoyun Song
Jong C. Park
Youngjin Kwon
51
0
0
08 Feb 2025
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Nikhil Bhendawade
Mahyar Najibi
Devang Naik
Irina Belousova
MoE
85
0
0
04 Feb 2025
Constrained Decoding with Speculative Lookaheads
Constrained Decoding with Speculative Lookaheads
Nishanth Nakshatri
Shamik Roy
Rajarshi Das
Suthee Chaidaroon
Leonid Boytsov
Rashmi Gangadharaiah
79
0
0
09 Dec 2024
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Zhuofan Wen
Shangtong Gui
Yang Feng
98
2
0
25 Nov 2024
SSSD: Simply-Scalable Speculative Decoding
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
LRM
44
2
0
08 Nov 2024
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
Ryan Sun
Tianyi Zhou
Xun Chen
Lichao Sun
32
4
0
08 Nov 2024
Privacy Risks of Speculative Decoding in Large Language Models
Privacy Risks of Speculative Decoding in Large Language Models
Jiankun Wei
Abdulrahman Abdulrazzag
Tianchen Zhang
Adel Muursepp
Gururaj Saileshwar
35
2
0
01 Nov 2024
A Theoretical Perspective for Speculative Decoding Algorithm
A Theoretical Perspective for Speculative Decoding Algorithm
Ming Yin
Minshuo Chen
Kaixuan Huang
Mengdi Wang
32
4
0
30 Oct 2024
FIRP: Faster LLM inference via future intermediate representation
  prediction
FIRP: Faster LLM inference via future intermediate representation prediction
Pengfei Wu
Jiahao Liu
Zhuocheng Gong
Qifan Wang
Jinpeng Li
Jingang Wang
Xunliang Cai
Dongyan Zhao
AI4CE
29
0
0
27 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
41
5
0
08 Oct 2024
Interactive Speculative Planning: Enhance Agent Efficiency through
  Co-design of System and User Interface
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface
Wenyue Hua
Mengting Wan
Shashank Vadrevu
Ryan Nadel
Yongfeng Zhang
Chi Wang
LLMAG
39
1
0
30 Sep 2024
Whisper in Medusa's Ear: Multi-head Efficient Decoding for
  Transformer-based ASR
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman
Aviv Shamsian
Aviv Navon
Gill Hetz
Joseph Keshet
32
1
0
24 Sep 2024
Efficiently Dispatching Flash Attention For Partially Filled Attention
  Masks
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
Agniv Sharma
Jonas Geiping
29
0
0
23 Sep 2024
Achieving Peak Performance for Large Language Models: A Systematic
  Review
Achieving Peak Performance for Large Language Models: A Systematic Review
Z. R. K. Rostam
Sándor Szénási
Gábor Kertész
40
3
0
07 Sep 2024
TF-Attack: Transferable and Fast Adversarial Attacks on Large Language
  Models
TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models
Zelin Li
Kehai Chen
Lemao Liu
Xuefeng Bai
Mingming Yang
Yang Xiang
Min Zhang
AAML
39
0
0
26 Aug 2024
Intelligent Router for LLM Workloads: Improving Performance Through
  Workload-Aware Scheduling
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
Kunal Jain
Anjaly Parayil
Ankur Mallick
Esha Choukse
Xiaoting Qin
...
Chetan Bansal
Victor Rühle
Anoop Kulkarni
Steve Kofsky
Saravan Rajmohan
52
3
0
24 Aug 2024
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo
Yixuan Wang
Qingfu Zhu
Zhiming Zhang
Xuanyu Zhang
Qing Yang
Dongliang Xu
39
4
0
16 Aug 2024
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft
  Heads with Adversarial Learning
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
Kaiqi Zhang
Jing Zhao
Rui Chen
39
1
0
15 Aug 2024
CREST: Effectively Compacting a Datastore For Retrieval-Based
  Speculative Decoding
CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho
Jinsol Park
Patrick Wang
34
0
0
08 Aug 2024
Clover-2: Accurate Inference for Regressive Lightweight Speculative
  Decoding
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao
Lujun Gui
Lei Su
Weipeng Chen
31
3
0
01 Aug 2024
Graph-Structured Speculative Decoding
Graph-Structured Speculative Decoding
Zhuocheng Gong
Jiahao Liu
Ziyue Wang
Pengfei Wu
Jingang Wang
Xunliang Cai
Dongyan Zhao
Rui Yan
31
3
0
23 Jul 2024
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined
  Speculation
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Branden Butler
Sixing Yu
Arya Mazaheri
Ali Jannesari
LRM
46
6
0
16 Jul 2024
Let the Code LLM Edit Itself When You Edit the Code
Let the Code LLM Edit Itself When You Edit the Code
Zhenyu He
Jun Zhang
Shengjie Luo
Jingjing Xu
Z. Zhang
Di He
KELM
39
0
0
03 Jul 2024
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu
Bowen Lei
Ruqi Zhang
Dongkuan Xu
35
3
0
27 Jun 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
46
13
0
25 Jun 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
90
55
0
24 Jun 2024
Optimized Speculative Sampling for GPU Hardware Accelerators
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner
Seanie Lee
Ilja Baumann
Philipp Seeberger
Korbinian Riedhammer
Tobias Bocklet
48
3
0
16 Jun 2024
Speculative Decoding via Early-exiting for Faster LLM Inference with
  Thompson Sampling Control Mechanism
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism
Jiahao Liu
Qifan Wang
Jingang Wang
Xunliang Cai
30
7
0
06 Jun 2024
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM
  Inference on Consumer Devices
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
Ruslan Svirschevski
Avner May
Zhuoming Chen
Beidi Chen
Zhihao Jia
Max Ryabinin
39
12
0
04 Jun 2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho
Sangmin Bae
Taehyeon Kim
Hyunjik Jo
Yireun Kim
Tal Schuster
Adam Fisch
James Thorne
Se-Young Yun
47
8
0
04 Jun 2024
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Owen Dugan
Donato Manuel Jimenez Beneto
Charlotte Loh
Zhuo Chen
Rumen Dangovski
Marin Soljacic
LRM
37
1
0
04 Jun 2024
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling
  for LLM
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM
Quandong Wang
Yuxuan Yuan
Xiaoyu Yang
Ruike Zhang
Kang Zhao
Wei Liu
Jian Luan
Daniel Povey
Bin Wang
53
0
0
03 Jun 2024
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for
  Low-Memory GPUs
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong
Manasa Bharadwaj
47
5
0
30 May 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang
Xudong Guo
Mengdi Wang
47
19
0
30 May 2024
Faster Cascades via Speculative Decoding
Faster Cascades via Speculative Decoding
Harikrishna Narasimhan
Wittawat Jitkrittum
A. S. Rawat
Seungyeon Kim
Neha Gupta
A. Menon
Sanjiv Kumar
LRM
44
6
0
29 May 2024
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li
Xilun Chen
Ari Holtzman
Beidi Chen
Jimmy Lin
Wen-tau Yih
Xi Lin
RALM
BDL
108
10
0
29 May 2024
A Declarative System for Optimizing AI Workloads
A Declarative System for Optimizing AI Workloads
Chunwei Liu
Matthew Russo
Michael Cafarella
Lei Cao
Peter Baille Chen
Zui Chen
Michael Franklin
Tim Kraska
Samuel Madden
Gerardo Vitagliano
47
21
0
23 May 2024
A Comprehensive Survey of Accelerated Generation Techniques in Large
  Language Models
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
Mahsa Khoshnoodi
Vinija Jain
Mingye Gao
Malavika Srikanth
Aman Chadha
OffRL
37
1
0
15 May 2024
12
Next