ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.01005
  4. Cited By
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
v1v2 (latest)

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

2 January 2025
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
Stephanie Wang
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
ArXiv (abs)PDFHTML

Papers citing "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving"

50 / 59 papers shown
Title
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Yushen Chen
Jiawei Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
104
1
0
01 Jul 2025
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro
Zhenyu Zhang
Souvik Kundu
Zhangyang Wang
Aditya Akella
112
0
0
11 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
131
0
0
05 Jun 2025
Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
Austin Silveria
Soham V. Govande
Daniel Y. Fu
35
0
0
03 Jun 2025
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
Ping Gong
Jiawei Yi
Shengnan Wang
Juncheng Zhang
Zewen Jin
...
Tong Yang
Gong Zhang
Renhai Chen
Feng Wu
Cheng Li
73
0
0
03 Jun 2025
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu
Jiaxuan Gao
Xujie Shen
Chen Zhu
Zhiyu Mei
...
Jun Mei
Jiashu Wang
Tongkai Yang
Binhang Yuan
Yi Wu
OffRLSyDaLRM
125
0
0
30 May 2025
Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Donghyeon Joo
Helya Hosseini
Ramyad Hadidi
Bahar Asgari
76
0
0
28 May 2025
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Pencuo Zeren
Qiuming Luo
Rui Mao
Chang Kong
33
0
0
13 May 2025
GPU Performance Portability needs Autotuning
GPU Performance Portability needs Autotuning
Burkhard Ringlein
Thomas Parnell
Radu Stoica
491
0
0
30 Apr 2025
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Wan Borui
Zhao Juntao
Jiang Chenyu
Guo Chuanxiong
Wu Chuan
VLM
155
1
0
13 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
314
3
0
10 Apr 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
MQ
453
0
0
24 Mar 2025
PERCY: Personal Emotional Robotic Conversational System
PERCY: Personal Emotional Robotic Conversational System
Zhijin Meng
Mohammed Althubyani
Shengyuan Xie
Imran Razzak
Eduardo Benitez Sandoval
Mahdi Bamdad
Francisco Cruz
409
0
0
04 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLLOnRL
188
0
0
03 Mar 2025
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Kan Zhu
Tian Tang
Qinyu Xu
Yile Gu
Zhichen Zeng
Rohan Kadekodi
Liangyu Zhao
Ang Li
Arvind Krishnamurthy
Baris Kasikci
114
4
0
17 Feb 2025
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
Ting Sun
Penghan Wang
Fan Lai
117
1
0
17 Feb 2025
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Twilight: Adaptive Attention Sparsity with Hierarchical Top-ppp Pruning
C. Lin
Jiaming Tang
Shuo Yang
Hanshuo Wang
Tian Tang
Boyu Tian
Ion Stoica
Enze Xie
Mingyu Gao
201
5
0
04 Feb 2025
Adaptive Self-improvement LLM Agentic System for ML Library Development
Adaptive Self-improvement LLM Agentic System for ML Library Development
Genghan Zhang
Weixin Liang
Olivia Hsu
K. Olukotun
456
3
0
04 Feb 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
577
3
0
15 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM
  Inference
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
Ramachandran Ramjee
Ashish Panwar
125
12
0
23 Oct 2024
FlashMask: Efficient and Rich Mask Extension of FlashAttention
FlashMask: Efficient and Rich Mask Extension of FlashAttention
Guoxia Wang
Jinle Zeng
Xiyuan Xiao
Siming Wu
Jiabin Yang
Lujing Zheng
Zeyu Chen
Jiang Bian
Dianhai Yu
Haifeng Wang
404
3
0
02 Oct 2024
NanoFlow: Towards Optimal Large Language Model Serving Throughput
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
...
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
99
35
0
22 Aug 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and
  Low-precision
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
151
157
0
11 Jul 2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang
Yilong Zhao
Kan Zhu
Guangxuan Xiao
Baris Kasikci
Song Han
139
107
0
16 Jun 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic
  Variable
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin
Zhenhua Han
Chengruidong Zhang
Yuqing Yang
Fan Yang
Chen Chen
Lili Qiu
144
46
0
30 May 2024
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Rühle
Saravan Rajmohan
170
8
0
17 May 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
177
32
0
07 May 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
114
190
0
04 Mar 2024
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and
  Two-Phase Partition
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Lu Ye
Ze Tao
Yong Huang
Yang Li
116
34
0
23 Feb 2024
RelayAttention for Efficient Large Language Model Serving with Long
  System Prompts
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Lei Zhu
Xinjiang Wang
Wayne Zhang
Rynson W. H. Lau
109
8
0
22 Feb 2024
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Zhuoming Chen
Avner May
Ruslan Svirschevski
Yuhsun Huang
Max Ryabinin
Zhihao Jia
Beidi Chen
110
55
0
19 Feb 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
168
26
0
10 Feb 2024
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Jordan Juravsky
Bradley Brown
Ryan Ehrlich
Daniel Y. Fu
Christopher Ré
Azalia Mirhoseini
134
39
0
07 Feb 2024
Medusa: Simple LLM Inference Acceleration Framework with Multiple
  Decoding Heads
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai
Yuhong Li
Zhengyang Geng
Hongwu Peng
Jason D. Lee
De-huai Chen
Tri Dao
207
330
0
19 Jan 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
142
198
0
11 Dec 2023
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
Ruihang Lai
Junru Shao
Siyuan Feng
Steven Lyubomirsky
Bohan Hou
...
Sunghyun Park
Prakalp Srivastava
Jared Roesch
T. Mowry
Tianqi Chen
113
11
0
01 Nov 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu
Matei A. Zaharia
Pieter Abbeel
149
264
0
03 Oct 2023
Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TSRALM
191
793
0
29 Sep 2023
Efficient Memory Management for Large Language Model Serving with
  PagedAttention
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
255
2,489
0
12 Sep 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work
  Partitioning
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
LRM
185
1,341
0
17 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
756
4,690
0
09 Jun 2023
SpecInfer: Accelerating Generative Large Language Model Serving with
  Tree-based Speculative Inference and Verification
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Zeyu Wang
...
Chunan Shi
Zhuoming Chen
Daiyaan Arfeen
Reyna Abhyankar
Zhihao Jia
LRM
154
157
0
16 May 2023
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix
  Multiplication on the GPU
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
Muhammad Osama
D. Merrill
C. Cecka
M. Garland
John Douglas Owens
68
29
0
09 Jan 2023
ByteTransformer: A High-Performance Transformer Boosted for
  Variable-Length Inputs
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Yujia Zhai
Chengquan Jiang
Leyuan Wang
Xiaoying Jia
Shang Zhang
Zizhong Chen
Xin Liu
Yibo Zhu
156
54
0
06 Oct 2022
Efficient Quantized Sparse Matrix Operations on Tensor Cores
Efficient Quantized Sparse Matrix Operations on Tensor Cores
Shigang Li
Kazuki Osawa
Torsten Hoefler
162
32
0
14 Sep 2022
FP8 Formats for Deep Learning
FP8 Formats for Deep Learning
Paulius Micikevicius
Dusan Stosic
N. Burgess
Marius Cornea
Pradeep Dubey
...
Naveen Mellempudi
S. Oberman
Mohammad Shoeybi
Michael Siu
Hao Wu
BDLVLMMQ
165
143
0
12 Sep 2022
SparseTIR: Composable Abstractions for Sparse Compilation in Deep
  Learning
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Zihao Ye
Ruihang Lai
Junru Shao
Tianqi Chen
Luis Ceze
169
99
0
11 Jul 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
493
2,310
0
27 May 2022
TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs
TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs
Yuke Wang
Boyuan Feng
Zheng Wang
Guyue Huang
Yufei Ding
GNN
119
30
0
03 Dec 2021
Sequential Aggregation and Rematerialization: Distributed Full-batch
  Training of Graph Neural Networks on Large Graphs
Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs
Hesham Mostafa
GNN
111
26
0
11 Nov 2021
12
Next