Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.18677
Cited By
Splitwise: Efficient generative LLM inference using phase splitting
30 November 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Splitwise: Efficient generative LLM inference using phase splitting"
50 / 111 papers shown
Title
How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
Nidhal Jegham
Marwen Abdelatti
Lassad Elmoubarki
Abdeltawab Hendawi
26
0
0
14 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
142
0
0
02 May 2025
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
117
0
0
29 Apr 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Zhengyuan Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Zehao Wang
Baoxing Huai
Hao Fei
LLMAG
77
0
0
28 Apr 2025
GenTorrent: Scaling Large Language Model Serving with An Overley Network
Fei Fang
Yifan Hua
Shengze Wang
Ruilin Zhou
Y. Liu
Chen Qian
Xuzhi Zhang
60
0
0
27 Apr 2025
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
58
0
0
24 Apr 2025
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
Yinmin Zhong
Zili Zhang
Xiaoniu Song
Hanpeng Hu
Chao Jin
...
Changyi Wan
Hongyu Zhou
Yimin Jiang
Yibo Zhu
Daxin Jiang
OffRL
AI4TS
59
0
0
22 Apr 2025
Splitwiser: Efficient LM inference with constrained resources
Asad Aali
Adney Cardoza
Melissa Capo
22
0
0
21 Apr 2025
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Ruicheng Ao
Gan Luo
D. Simchi-Levi
Xinshang Wang
31
2
0
15 Apr 2025
Understanding and Optimizing Multi-Stage AI Inference Pipelines
Abhimanyu Bambhaniya
Hanjiang Wu
Suvinay Subramanian
Sudarshan Srinivasan
Souvik Kundu
Amir Yazdanbakhsh
Suvinay Subramanian
Madhu Kumar
Tushar Krishna
159
0
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
31
0
0
14 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
64
0
0
12 Apr 2025
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
Aashaka Shah
Abhinav Jangda
Yangqiu Song
Caio Rocha
Changho Hwang
...
Peng Cheng
Qinghua Zhou
Roshan Dathathri
Saeed Maleki
Ziyue Yang
GNN
54
0
0
11 Apr 2025
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
Shihong Gao
Xuzhi Zhang
Yanyan Shen
Lei Chen
27
1
0
10 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
144
1
0
10 Apr 2025
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Yanhao Dong
Yubo Miao
Weinan Li
Xiao Zheng
Chao Wang
Feng Lyu
31
0
0
08 Apr 2025
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Siyuan Chen
Zhipeng Jia
S. Khan
Arvind Krishnamurthy
Phillip B. Gibbons
31
3
0
05 Apr 2025
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
Weiqing Li
Guochao Jiang
Xiangyong Ding
Zhangcheng Tao
Chuzhan Hao
Chenfeng Xu
Yuewei Zhang
Hao Wang
31
0
0
03 Apr 2025
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
Xiaoye Qu
Yafu Li
Zhaochen Su
Weigao Sun
Jianhao Yan
...
Chaochao Lu
Yue Zhang
Xian-Sheng Hua
Bowen Zhou
Yu Cheng
ReLM
OffRL
LRM
91
16
0
27 Mar 2025
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
Yunkai Liang
Zhangyu Chen
Pengfei Zuo
Zhi Zhou
Xu Chen
Zhou Yu
88
3
0
26 Mar 2025
SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices
Jian Ma
Xinchen Lyu
Jun Jiang
Qimei Cui
Haipeng Yao
Xiaofeng Tao
53
0
0
23 Mar 2025
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Wenqi Jiang
Suvinay Subramanian
Cat Graves
Gustavo Alonso
Amir Yazdanbakhsh
Vidushi Dadu
49
6
0
18 Mar 2025
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen
Tanmoy Sen
49
0
0
17 Mar 2025
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Tairan Xu
Leyang Xue
Zhan Lu
Adrian Jackson
Luo Mai
MoE
98
2
0
12 Mar 2025
Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge
Maximilian Abstreiter
Sasu Tarkoma
Roberto Morabito
49
0
0
12 Mar 2025
Queueing, Predictions, and LLMs: Challenges and Open Problems
Michael Mitzenmacher
Rana Shahout
AI4TS
LRM
44
1
0
10 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
77
0
0
09 Mar 2025
Green Prompting
Marta Adamska
Daria Smirnova
Hamid Nasiri
Zhengxin Yu
Peter Garraghan
184
0
0
09 Mar 2025
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Yingfeng Luo
Tong Zheng
Yongyu Mu
Yangqiu Song
Qinghong Zhang
...
Ziqiang Xu
Peinan Feng
Xiaoqian Liu
Tong Xiao
Jingbo Zhu
AI4CE
194
0
0
09 Mar 2025
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
Kaiyu Huang
Yu Wang
Zhubo Shi
Han Zou
Minchen Yu
Qingjiang Shi
LRM
41
2
0
07 Mar 2025
SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph
Teng Lin
Yizhang Zhu
Yuyu Luo
Nan Tang
RALM
3DV
44
1
0
03 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
103
0
0
03 Mar 2025
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
Qihui Zhou
Peiqi Yin
Pengfei Zuo
James Cheng
CLL
40
1
0
01 Mar 2025
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Teng Lin
RALM
68
2
0
26 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
43
5
0
21 Feb 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
48
2
0
14 Feb 2025
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
Hanfei Yu
Xingqi Cui
Huan Zhang
Han Wang
Hao Wang
MoE
61
0
0
07 Feb 2025
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
48
1
0
28 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhuoming Chen
Sean Lai
Xinhao Cheng
Xupeng Miao
Zhihao Jia
53
6
0
21 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
172
1
0
15 Jan 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
205
0
0
08 Jan 2025
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Esha Choukse
Haoran Qiu
Rodrigo Fonseca
Josep Torrellas
Ricardo Bianchini
42
4
0
05 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
Kaipeng Zhang
Chong Chen
Fan Yang
Yuqing Yang
Lili Qiu
54
30
0
03 Jan 2025
Efficiently Serving LLM Reasoning Programs with Certaindex
Yichao Fu
Junda Chen
Siqi Zhu
Zheyu Fu
Zhongdongming Dai
Aurick Qiao
Hao Zhang
LRM
57
13
0
31 Dec 2024
LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
Hyucksung Kwon
Kyungmo Koo
Janghyeon Kim
W. Lee
Minjae Lee
...
Yongkee Kwon
Ilkon Kim
Euicheol Lim
John Kim
Jungwook Choi
71
4
0
28 Dec 2024
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Saurabh Agarwal
Anyong Mao
Aditya Akella
Shivaram Venkataraman
LLMAG
80
0
0
21 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
119
1
0
18 Dec 2024
Accelerating Retrieval-Augmented Generation
Derrick Quinn
Mohammad Nouri
Neel Patel
John Salihu
Alireza Salemi
Sukhan Lee
Hamed Zamani
Mohammad Alian
RALM
3DV
87
4
0
14 Dec 2024
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu
Chong Li
Jieru Zhao
Chenqi Zhang
M. Guo
76
8
0
04 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Ion Stoica
123
5
0
25 Nov 2024
1
2
3
Next