Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.22134
Cited By
v1
v2 (latest)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
29 October 2024
Xiaoniu Song
Zihang Zhong
Rong Chen
Haibo Chen
MoE
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ProMoE: Fast MoE-based LLM Serving using Proactive Caching"
29 / 29 papers shown
Title
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Jingcong Liang
Siyuan Wang
Miren Tian
Yitong Li
Duyu Tang
Zhongyu Wei
MoE
58
0
0
21 May 2025
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Yuxin Zhou
Zheng Li
Junxuan Zhang
Jue Wang
Yanjie Wang
Zhongle Xie
Ke Chen
Lidan Shou
MoE
116
0
0
09 May 2025
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
Shuzhang Zhong
Yizhou Sun
Ling Liang
Runsheng Wang
R. Huang
Meng Li
MoE
98
1
0
08 Apr 2025
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Tairan Xu
Leyang Xue
Zhan Lu
Adrian Jackson
Kai Zou
MoE
132
2
0
12 Mar 2025
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
Hanfei Yu
Xingqi Cui
Huatian Zhang
Han Wang
Hao Wang
MoE
103
1
0
07 Feb 2025
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee
Jungi Lee
Junghwan Seo
Jaewoong Sim
RALM
73
94
0
28 Jun 2024
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Fuzhao Xue
Zian Zheng
Yao Fu
Jinjie Ni
Zangwei Zheng
Wangchunshu Zhou
Yang You
MoE
82
103
0
29 Jan 2024
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Connor Holmes
Masahiro Tanaka
Michael Wyatt
A. A. Awan
Jeff Rasley
...
Reza Yazdani Aminabadi
Heyang Qin
Arash Bakhtiari
Lev Kurilenko
Yuxiong He
66
71
0
09 Jan 2024
Fast Inference of Mixture-of-Experts Language Models with Offloading
Artyom Eliseev
Denis Mazur
MoE
101
44
0
28 Dec 2023
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song
Zeyu Mi
Haotong Xie
Haibo Chen
BDL
149
134
0
16 Dec 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu
Jue Wang
Tri Dao
Dinesh Manocha
Binhang Yuan
...
Anshumali Shrivastava
Ce Zhang
Yuandong Tian
Christopher Ré
Beidi Chen
BDL
92
217
0
26 Oct 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
192
2,311
0
12 Sep 2023
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
Rui Kong
Yuanchun Li
Qingtian Feng
Weijun Wang
Xiaozhou Ye
Ye Ouyang
Lingyu Kong
Yunxin Liu
MoE
64
9
0
29 Aug 2023
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Ranggi Hwang
Jianyu Wei
Shijie Cao
Changho Hwang
Xiaohu Tang
Ting Cao
Mao Yang
MoE
78
44
0
23 Aug 2023
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Zeyu Wang
...
Chunan Shi
Zhuoming Chen
Daiyaan Arfeen
Reyna Abhyankar
Zhihao Jia
LRM
105
155
0
16 May 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
190
400
0
13 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
1.5K
13,437
0
27 Feb 2023
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar
Dan Alistarh
VLM
103
731
0
02 Jan 2023
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
147
733
0
30 Nov 2022
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar
Saleh Ashkboos
Torsten Hoefler
Dan Alistarh
MQ
135
1,005
0
31 Oct 2022
No Language Left Behind: Scaling Human-Centered Machine Translation
Nllb team
Marta R. Costa-jussá
James Cross
Onur cCelebi
Maha Elbayad
...
Alexandre Mourachko
C. Ropers
Safiyyah Saleem
Holger Schwenk
Jeff Wang
MoE
228
1,266
0
11 Jul 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
252
2,279
0
27 May 2022
OPT: Open Pre-trained Transformer Language Models
Susan Zhang
Stephen Roller
Naman Goyal
Mikel Artetxe
Moya Chen
...
Daniel Simig
Punit Singh Koura
Anjali Sridhar
Tianlu Wang
Luke Zettlemoyer
VLM
OSLM
AI4CE
353
3,695
0
02 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
883
13,176
0
04 Mar 2022
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
88
2,220
0
11 Jan 2021
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
868
42,379
0
28 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
611
4,905
0
23 Jan 2020
The Case for Learned Index Structures
Tim Kraska
Alex Beutel
Ed H. Chi
J. Dean
N. Polyzotis
85
1,042
0
04 Dec 2017
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam M. Shazeer
Azalia Mirhoseini
Krzysztof Maziarz
Andy Davis
Quoc V. Le
Geoffrey E. Hinton
J. Dean
MoE
253
2,686
0
23 Jan 2017
1