Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.05908
Cited By
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
12 July 2023
Seongjun Yang
Gibbeum Lee
Jaewoong Cho
Dimitris Papailiopoulos
Kangwook Lee
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding"
24 / 24 papers shown
Title
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Jie Ou
Jinyu Guo
Shuaihong Jiang
Zhaokun Wang
Libo Qin
Shunyu Yao
Wenhong Tian
3DV
116
0
0
19 May 2025
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
153
11
0
03 Mar 2025
Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
Ziyi Wang
Muneeza Azmart
Ang Li
R. Horesh
Mikhail Yurochkin
142
1
0
11 Feb 2025
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Yunjia Xi
Hangyu Wang
Bo Chen
Jianghao Lin
Menghui Zhu
Wen Liu
Ruiming Tang
Zhewei Wei
Weinan Zhang
Yong Yu
OffRL
112
4
0
11 Aug 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
73
15
0
25 Jun 2024
Decoding Speculative Decoding
Minghao Yan
Saurabh Agarwal
Shivaram Venkataraman
LRM
63
8
0
02 Feb 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
75
144
0
26 Jan 2024
Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim
Coleman Hooper
Thanakul Wattanawong
Minwoo Kang
Ruohan Yan
...
Qijing Huang
Kurt Keutzer
Michael W. Mahoney
Y. Shao
A. Gholami
MQ
81
102
0
27 Feb 2023
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
81
663
0
30 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
55
32
0
21 Nov 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
130
1,915
0
29 Mar 2022
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
Eldar Kurtic
Daniel Fernando Campos
Tuan Nguyen
Elias Frantar
Mark Kurtz
Ben Fineran
Michael Goin
Dan Alistarh
VLM
MQ
MedIm
62
122
0
14 Mar 2022
A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation
Tianxiang Sun
Xiangyang Liu
Wei-wei Zhu
Zhichao Geng
Lingling Wu
Yilong He
Yuan Ni
Guotong Xie
Xuanjing Huang
Xipeng Qiu
69
41
0
03 Mar 2022
LaMDA: Language Models for Dialog Applications
R. Thoppilan
Daniel De Freitas
Jamie Hall
Noam M. Shazeer
Apoorv Kulshreshtha
...
Blaise Aguera-Arcas
Claire Cui
M. Croak
Ed H. Chi
Quoc Le
ALM
105
1,577
0
20 Jan 2022
AdaViT: Adaptive Tokens for Efficient Vision Transformer
Hongxu Yin
Arash Vahdat
J. Álvarez
Arun Mallya
Jan Kautz
Pavlo Molchanov
ViT
60
327
0
14 Dec 2021
I-BERT: Integer-only BERT Quantization
Sehoon Kim
A. Gholami
Z. Yao
Michael W. Mahoney
Kurt Keutzer
MQ
133
348
0
05 Jan 2021
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
Jungo Kasai
Nikolaos Pappas
Hao Peng
James Cross
Noah A. Smith
57
135
0
18 Jun 2020
GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference
Ali Hadi Zadeh
Isak Edo
Omar Mohamed Awad
Andreas Moshovos
MQ
51
186
0
08 May 2020
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
136
7,437
0
02 Oct 2019
Reducing Transformer Depth on Demand with Structured Dropout
Angela Fan
Edouard Grave
Armand Joulin
93
588
0
25 Sep 2019
TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao
Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Linlin Li
F. Wang
Qun Liu
VLM
65
1,847
0
23 Sep 2019
Are Sixteen Heads Really Better than One?
Paul Michel
Omer Levy
Graham Neubig
MoE
80
1,051
0
25 May 2019
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
76
1,120
0
23 May 2019
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
166
8,067
0
16 Jun 2016
1