Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.01005
Cited By
v1
v2 (latest)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
2 January 2025
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
Stephanie Wang
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving"
9 / 59 papers shown
Title
Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective
Hengrui Zhang
Zhongming Yu
Guohao Dai
Guyue Huang
Yufei Ding
Yuan Xie
Yu Wang
GNN
66
48
0
18 Oct 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
355
780
0
27 Aug 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
415
2,572
0
20 Apr 2021
FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks
Md. Khaledur Rahman
Majedul Haque Sujon
A. Azad
FedML
GNN
79
51
0
07 Nov 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
300
4,120
0
10 Apr 2020
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
190
493
0
06 Nov 2019
Online normalizer calculation for softmax
Maxim Milakov
N. Gimelshein
135
95
0
08 May 2018
Block-Sparse Recurrent Neural Networks
Sharan Narang
Eric Undersander
G. Diamos
69
139
0
08 Nov 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
1.2K
133,803
0
12 Jun 2017
Previous
1
2