ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.14135
  4. Cited By
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
v1v2 (latest)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

27 May 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
    VLM
ArXiv (abs)PDFHTML

Papers citing "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"

50 / 1,508 papers shown
Title
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
HamidReza Imani
Jiaxin Peng
Peiman Mohseni
Abdolah Amirany
Tarek A. El-Ghazawi
MoE
127
0
0
10 May 2025
Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
Isaac Gerber
66
1
0
10 May 2025
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
Seunghee Han
S. Choi
Joo-Young Kim
59
0
0
09 May 2025
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Haolin Zhang
Jeff Huang
69
0
0
09 May 2025
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
Jiahao Li
Weijian Ma
Xueyang Li
Yunzhong Lou
G. Zhou
Xiangdong Zhou
126
3
0
07 May 2025
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Xueyao Zhang
Yijiao Wang
Chaoren Wang
Zehan Li
Zhuo Chen
Zhizheng Wu
324
0
0
07 May 2025
MFSeg: Efficient Multi-frame 3D Semantic Segmentation
MFSeg: Efficient Multi-frame 3D Semantic Segmentation
Chengjie Huang
Krzysztof Czarnecki
3DPC
78
0
0
07 May 2025
SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting
SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting
Shiwei Guo
Zheyu Chen
Yupeng Ma
Yunfei Han
Yi Wang
AI4TS
418
0
0
05 May 2025
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
Xueguang Ma
Luyu Gao
Shengyao Zhuang
Jiaqi Samantha Zhan
Jamie Callan
Jimmy Lin
449
2
0
05 May 2025
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution
Xingyu Zhou
Wei Long
Jingbo Lu
Shiyin Jiang
Weiyi You
Haifeng Wu
Shuhang Gu
87
0
0
04 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
Shanghang Zhang
Y. Hu
Zining Liu
MoE
441
0
0
02 May 2025
Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation
Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation
Jianxing Qin
Jingrong Chen
Xinhao Kong
Yongji Wu
Liang Luo
Ziyi Wang
Ying Zhang
Tingjun Chen
Alvin R. Lebeck
Danyang Zhuo
317
0
0
02 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoEVLM
268
2
0
01 May 2025
Scaling On-Device GPU Inference for Large Generative Models
Scaling On-Device GPU Inference for Large Generative Models
Jiuqiang Tang
Raman Sarokin
Ekaterina Ignasheva
Grant Jensen
Lin Chen
Juhyun Lee
Andrei Kulik
Matthias Grundmann
370
1
0
01 May 2025
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
Muyi Bao
Shuchang Lyu
Zhaoyang Xu
Huiyu Zhou
Jinchang Ren
Shiming Xiang
Xuelong Li
Guangliang Cheng
Mamba
265
0
0
01 May 2025
GPU Performance Portability needs Autotuning
GPU Performance Portability needs Autotuning
Burkhard Ringlein
Thomas Parnell
Radu Stoica
451
0
0
30 Apr 2025
Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection
Ziqing Fan
Siyuan Du
Shengchao Hu
Pingjie Wang
Li Shen
Yanzhe Zhang
Dacheng Tao
Yucheng Wang
92
2
0
29 Apr 2025
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
286
0
0
29 Apr 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
54
0
0
29 Apr 2025
Blockbuster, Part 1: Block-level AI Operator Fusion
Blockbuster, Part 1: Block-level AI Operator Fusion
Ofer Dekel
53
0
0
29 Apr 2025
A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models
Kohei Saijo
Tetsuji Ogawa
85
1
0
28 Apr 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Ziyi Wang
Steven Li
106
3
0
28 Apr 2025
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling
Ishan Kavathekar
Raghav Donakanti
Ponnurangam Kumaraguru
Karthik Vaidhyanathan
142
1
0
27 Apr 2025
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1
Mingda Zhang
Jianglong Qin
MedIm
54
0
0
25 Apr 2025
Embedding Empirical Distributions for Computing Optimal Transport Maps
Embedding Empirical Distributions for Computing Optimal Transport Maps
Mingchen Jiang
Peng Xu
Xichen Ye
Xiaohui Chen
Yun Yang
Yifan Chen
OT
126
0
0
24 Apr 2025
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
Qingyuan Liu
Liyan Chen
Yanning Yang
Haoyu Wang
Dong Du
Zhigang Mao
Naifeng Jing
Yubin Xia
Haibo Chen
66
0
0
24 Apr 2025
An Empirical Study on Prompt Compression for Large Language Models
An Empirical Study on Prompt Compression for Large Language Models
Zhenru Zhang
Jinyi Li
Yihuai Lan
Xinze Wang
Hao Wang
MQ
84
0
0
24 Apr 2025
TileLang: A Composable Tiled Programming Model for AI Systems
TileLang: A Composable Tiled Programming Model for AI Systems
Lei Wang
Yu Cheng
Yining Shi
Zhengju Tang
Zhiwen Mo
...
Lingxiao Ma
Yuqing Xia
Jilong Xue
Fan Yang
Zhiyong Yang
103
2
0
24 Apr 2025
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
102
1
0
24 Apr 2025
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Ali Hassani
Fengzhe Zhou
Aditya Kane
Jiannan Huang
Chieh-Yun Chen
...
Bing Xu
Haicheng Wu
Wen-mei W. Hwu
Xuan Li
Humphrey Shi
56
1
0
23 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Ziqiang Liu
Dong Li
E. Barsoum
169
0
0
23 Apr 2025
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
Xinsong Zhang
Yaoyao Ding
Yang Hu
Gennady Pekhimenko
133
0
0
22 Apr 2025
llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length
llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length
Issa Sugiura
Kouta Nakayama
Yusuke Oda
68
1
0
22 Apr 2025
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
Ye Qiao
Zhiheng Cheng
Yian Wang
Yifan Zhang
Yunzhe Deng
Sitao Huang
216
0
0
22 Apr 2025
Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
Cong Qi
Hanzhang Fang
Tianxing Hu
Siqi Jiang
Wei Zhi
Mamba
120
0
0
22 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
146
1
0
22 Apr 2025
Efficient Pretraining Length Scaling
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
472
0
0
21 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
98
1
0
21 Apr 2025
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
Zheng Li
Yang Liu
Wei Zhang
Tailing Yuan
Bin Chen
Chengru Song
Di Zhang
46
0
0
20 Apr 2025
Learning to Attribute with Attention
Learning to Attribute with Attention
Benjamin Cohen-Wang
Yung-Sung Chuang
Aleksander Madry
61
0
0
18 Apr 2025
Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
Chaoyue Niu
Yucheng Ding
Junhui Lu
Zhengxiang Huang
Hang Zeng
Yutong Dai
Xuezhen Tu
Chengfei Lv
Fan Wu
Guihai Chen
124
1
0
17 Apr 2025
Antidistillation Sampling
Antidistillation Sampling
Yash Savani
Asher Trockman
Zhili Feng
Avi Schwarzschild
Alexander Robey
Marc Finzi
J. Zico Kolter
127
3
0
17 Apr 2025
An All-Atom Generative Model for Designing Protein Complexes
An All-Atom Generative Model for Designing Protein Complexes
Ruizhe Chen
Dongyu Xue
Xiangxin Zhou
Zaixiang Zheng
Xiangxiang Zeng
Quanquan Gu
97
2
0
17 Apr 2025
A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions
A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions
Rahima Khanam
Muhammad Hussain
98
0
0
16 Apr 2025
Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs
Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs
Hyungwoo Lee
Kihyun Kim
Jinwoo Kim
Jungmin So
Myung-Hoon Cha
H. Kim
James J. Kim
Youngjae Kim
79
0
0
16 Apr 2025
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
Yanjie Wang
Jiajian Li
Chaoyi Hong
Ruibo Li
Liusheng Sun
Xiao-yang Song
Zhe Wang
Zhiguo Cao
Guosheng Lin
MDE
102
0
0
16 Apr 2025
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
Junyang Zhang
Tianyi Zhu
Cheng Luo
A. Anandkumar
RALM
108
0
0
16 Apr 2025
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers
Run Wang
Gamze Islamoglu
Andrea Belano
Viviane Potocnik
Francesco Conti
Angelo Garofalo
Luca Benini
66
0
0
15 Apr 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen Deng
Zhengxin You
Long Xiang
Qilong Li
Peiqi Yuan
...
Man Lung Yiu
Huan Li
Qiaomu Shen
Rui Mao
Bo Tang
82
0
0
14 Apr 2025
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Juntao Zhao
Qi Lu
Wei Jia
Borui Wan
Lei Zuo
...
Size Zheng
Yanghua Peng
H. Lin
Xin Liu
Chuan Wu
AI4CE
139
0
0
14 Apr 2025
Previous
12345...293031
Next