Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14135
Cited By
v1
v2 (latest)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
27 May 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
50 / 1,508 papers shown
Title
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
Yixuan Wang
Shiyu Ji
Yijun Liu
Yuzhuang Xu
Yang Xu
Qingfu Zhu
Wanxiang Che
65
0
0
24 May 2025
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
Zhaoyuan Su
Tingfeng Lan
Zirui Wang
Juncheng Yang
Yue Cheng
9
0
0
24 May 2025
MTGR: Industrial-Scale Generative Recommendation Framework in Meituan
Ruidong Han
Bin Yin
S. Chen
He Jiang
F. Jiang
...
Yueming Han
M. Zhou
Lei Yu
Chuan Liu
Wei Lin
LRM
31
1
0
24 May 2025
Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement
Yuxin Ren
Maxwell D Collins
Miao Hu
Huanrui Yang
53
0
0
24 May 2025
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Can Yaras
Alec S. Xu
Pierre Abillama
Changwoo Lee
Laura Balzano
30
0
0
24 May 2025
Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization
Francois Chaubard
Mykel J. Kochenderfer
MQ
AI4CE
190
0
0
23 May 2025
BehaveGPT: A Foundation Model for Large-scale User Behavior Modeling
Jiahui Gong
Jingtao Ding
Fanjin Meng
Chen Yang
Hong Chen
Zuojian Wang
Haisheng Lu
Yong Li
43
0
0
23 May 2025
Less Context, Same Performance: A RAG Framework for Resource-Efficient LLM-Based Clinical NLP
Satya Narayana Cheetirala
Ganesh Raut
Dhavalkumar Patel
Fabio Sanatana
Robert Freeman
...
Omar Dawkins
Reba Miller
Randolph M. Steinhagen
Eyal Klang
Prem Timsina
RALM
26
0
0
23 May 2025
Understanding Differential Transformer Unchains Pretrained Self-Attentions
Chaerin Kong
Jiho Jang
Nojun Kwak
82
0
0
22 May 2025
Training-Free Efficient Video Generation via Dynamic Token Carving
Yuechen Zhang
Jinbo Xing
Bin Xia
Shaoteng Liu
Bohao Peng
Xin Tao
Pengfei Wan
Eric Lo
Jiaya Jia
DiffM
VGen
67
0
0
22 May 2025
CASTILLO: Characterizing Response Length Distributions of Large Language Models
Daniel F. Perez-Ramirez
Dejan Kostic
Magnus Boman
55
0
0
22 May 2025
Efficient Correlation Volume Sampling for Ultra-High-Resolution Optical Flow Estimation
Karlis Martins Briedis
Markus Gross
Christopher Schroers
29
0
0
22 May 2025
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
Wang Yang
Zirui Liu
Hongye Jin
Qingyu Yin
Vipin Chaudhary
Xiaotian Han
ReLM
LRM
68
0
0
22 May 2025
PaTH Attention: Position Encoding via Accumulating Householder Transformations
Songlin Yang
Yikang Shen
Kaiyue Wen
Shawn Tan
Mayank Mishra
Liliang Ren
Rameswar Panda
Yoon Kim
64
1
0
22 May 2025
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention
Chaoyi Jiang
Sungwoo Kim
Lei Gao
Hossein Entezari Zarch
Won Woo Ro
Murali Annavaram
15
0
0
22 May 2025
SwarmDiff: Swarm Robotic Trajectory Planning in Cluttered Environments via Diffusion Transformer
Kang Ding
Chunxuan Jiao
Yunze Hu
Kangjie Zhou
Pengying Wu
Yao Mu
Chang Liu
73
0
0
21 May 2025
Revealing Language Model Trajectories via Kullback-Leibler Divergence
Ryo Kishino
Yusuke Takase
Momose Oyama
Hiroaki Yamagiwa
Hidetoshi Shimodaira
92
0
0
21 May 2025
SAMA-UNet: Enhancing Medical Image Segmentation with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning
Saqib Qamar
Mohd Fazil
Parvez Ahmad
Ghulam Muhammad
Mamba
112
0
0
21 May 2025
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Penghao Wu
Lewei Lu
Ziwei Liu
126
0
0
21 May 2025
MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
Maike Behrendt
Stefan Sylvius Wagner
Stefan Harmeling
SSeg
175
0
0
21 May 2025
After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG
Xinbang Dai
Huikang Hu
Yuncheng Hua
Jiaqi Li
Yongrui Chen
Rihui Jin
Nan Hu
Guilin Qi
RALM
3DV
69
0
0
21 May 2025
SUS backprop: linear backpropagation algorithm for long inputs in transformers
Sergey Pankov
Georges Harik
110
0
0
21 May 2025
FLASH-D: FlashAttention with Hidden Softmax Division
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
51
0
0
20 May 2025
ModRWKV: Transformer Multimodality in Linear Time
Jiale Kang
Ziyin Yue
Qingyu Yin
Jiang Rui
W. Li
Zening Lu
Zhouran Ji
OffRL
91
0
0
20 May 2025
Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators
K. Alexandridis
Vasileios Titopoulos
G. Dimitrakopoulos
65
0
0
20 May 2025
Balanced and Elastic End-to-end Training of Dynamic LLMs
Mohamed Wahib
Muhammed Abdullah Soyturk
Didem Unat
MoE
51
0
0
20 May 2025
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu
Yiyu Wang
Junpeng Ma
Linfeng Zhang
VLM
55
0
0
20 May 2025
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
Yifan Sui
Hao Wang
Hanfei Yu
Yitao Hu
Jianxun Li
Hao Wang
42
0
0
20 May 2025
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Adam Štorek
Mukur Gupta
Samira Hajizadeh
Prashast Srivastava
Suman Jana
LRM
69
0
0
19 May 2025
An Empirical Study of Many-to-Many Summarization with Large Language Models
Jiaan Wang
Fandong Meng
Zengkui Sun
Yunlong Liang
Yuxuan Cao
Jiarong Xu
Haoxiang Shi
Jie Zhou
47
0
0
19 May 2025
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Penghui Qi
Zichen Liu
Tianyu Pang
Chao Du
W. Lee
Min Lin
OffRL
LRM
102
3
0
19 May 2025
Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-Constrained Pruning
Boxun Xu
Yuxuan Yin
Vikram Iyer
Peng Li
MoE
101
0
0
18 May 2025
PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
Wenqiao Zhu
Chao Xu
Lulu Wang
Jun Wu
107
1
0
18 May 2025
Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform
Josh Alman
Zhao Song
103
16
0
17 May 2025
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Xiechi Zhang
Zetian Ouyang
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Ya Zhang
Yanfeng Wang
Liang He
LM&MA
ELM
124
0
0
17 May 2025
Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model
Shen Li
Renfen Hu
Lijun Wang
ALM
50
0
0
17 May 2025
GeoMaNO: Geometric Mamba Neural Operator for Partial Differential Equations
Xi Han
Jingwei Zhang
Dimitris Samaras
Fei Hou
Hong Qin
AI4CE
106
0
0
17 May 2025
METHOD: Modular Efficient Transformer for Health Outcome Discovery
Linglong Qian
Zina M. Ibrahim
37
0
0
16 May 2025
Accurate KV Cache Quantization with Outlier Tokens Tracing
Yi Su
Yuechi Zhou
Quantong Qiu
Jilong Li
Qingrong Xia
Ping Li
Xinyu Duan
Zhefeng Wang
Min Zhang
MQ
69
1
0
16 May 2025
Flash Invariant Point Attention
Andrew Liu
Axel Elaldi
Nicholas T Franklin
Nathan Russell
Gurinder S Atwal
Yih-En A Ban
Olivia Viessmann
41
0
0
16 May 2025
Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Zhexiang Li
Haoyu Wang
Yutong Bao
David Woodruff
46
0
0
16 May 2025
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Ke Wang
Junting Pan
Linda Wei
Aojun Zhou
Weikang Shi
...
Han Xiao
Yiran Yang
Houxing Ren
Mingjie Zhan
Hongsheng Li
152
1
0
15 May 2025
Parallel Scaling Law for Language Models
Mouxiang Chen
Binyuan Hui
Zeyu Cui
Jiaxi Yang
Dayiheng Liu
Jianling Sun
Junyang Lin
Zhongxin Liu
MoE
LRM
91
2
0
15 May 2025
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Chenggang Zhao
Chengqi Deng
Chong Ruan
Damai Dai
Huazuo Gao
...
Wenfeng Liang
Ying He
Yun Wang
Yuxuan Liu
Y. X. Wei
MoE
72
1
0
14 May 2025
Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
Huafeng Shi
Jianzhong Liang
Rongchang Xie
Xian Wu
Cheng Chen
Chang Liu
VGen
85
0
0
14 May 2025
Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing
Chen Wu
Yin Song
MoE
LRM
75
1
0
13 May 2025
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Pencuo Zeren
Qiuming Luo
Rui Mao
Chang Kong
17
0
0
13 May 2025
Fused3S: Fast Sparse Attention on Tensor Cores
Zitong Li
Aparna Chandramowlishwaran
GNN
99
0
0
12 May 2025
OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain
Wenzhen Yue
Yang Liu
Haoxuan Li
Hao Wang
Xianghua Ying
Ruohao Guo
Bowei Xing
Ji Shi
AI4TS
OOD
88
0
0
12 May 2025
Putting It All into Context: Simplifying Agents with LCLMs
Mingjian Jiang
Yangjun Ruan
Luis A. Lastras
Pavan Kapanipathi
Tatsunori Hashimoto
LLMAG
92
1
0
12 May 2025
Previous
1
2
3
4
5
6
...
29
30
31
Next