Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.12757
Cited By
NanoFlow: Towards Optimal Large Language Model Serving Throughput
22 August 2024
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
Dedong Xie
Yufei Gao
Qinyu Xu
Tian Tang
Zihao Ye
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
Re-assign community
ArXiv
PDF
HTML
Papers citing
"NanoFlow: Towards Optimal Large Language Model Serving Throughput"
8 / 8 papers shown
Title
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
120
0
0
29 Apr 2025
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments
Yanyu Chen
Ganhong Huang
108
0
0
28 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
172
1
0
15 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
65
21
0
02 Jan 2025
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian
Fengcun Li
Xiangyang Ji
Xiaoyu Zhao
Jianchao Tan
Kaipeng Zhang
Xunliang Cai
MoE
79
3
0
16 Oct 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
62
26
0
07 May 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
71
20
0
10 Feb 2024
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,826
0
17 Sep 2019
1