Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.08671
Cited By
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
9 January 2024
Connor Holmes
Masahiro Tanaka
Michael Wyatt
A. A. Awan
Jeff Rasley
Samyam Rajbhandari
Reza Yazdani Aminabadi
Heyang Qin
Arash Bakhtiari
Lev Kurilenko
Yuxiong He
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference"
12 / 12 papers shown
Title
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
135
0
0
29 Apr 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Zheng Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Zehao Wang
Baoxing Huai
Hao Fei
LLMAG
77
0
0
28 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
32
0
0
21 Apr 2025
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference
Haiying Shen
Tanmoy Sen
Masahiro Tanaka
240
0
0
17 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
77
0
0
09 Mar 2025
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
Zhiyuan Fang
Yuegui Huang
Zicong Hong
Yufeng Lyu
Wuhui Chen
Yue Yu
Fan Yu
Zibin Zheng
MoE
48
0
0
09 Feb 2025
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments
Yanyu Chen
Ganhong Huang
113
0
0
28 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhuoming Chen
Sean Lai
Xinhao Cheng
Xupeng Miao
Zhihao Jia
53
6
0
21 Jan 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
280
0
0
08 Jan 2025
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Xiaoniu Song
Zihang Zhong
Rong Chen
Haibo Chen
MoE
68
4
0
29 Oct 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
50
10
0
08 Sep 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
68
26
0
07 May 2024
1