ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17694
50
0

FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding

23 May 2025
Zhibin Wang
Rui Ning
Chao Fang
Zhonghui Zhang
Xi Lin
Shaobo Ma
Mo Zhou
Xue Li
Zhongfeng Wang
Chengying Huan
Rong Gu
Kun Yang
Guihai Chen
Sheng Zhong
Chen Tian
ArXiv (abs)PDFHTML
Main:12 Pages
10 Figures
Bibliography:2 Pages
2 Tables
Abstract

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely FlashForge. FlashForge delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that FlashForge achieves an average 1.9x speedup and 120.9x memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and 3.8x end-to-end time per output token compared to the vLLM.

View on arXiv
@article{wang2025_2505.17694,
  title={ FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding },
  author={ Zhibin Wang and Rui Ning and Chao Fang and Zhonghui Zhang and Xi Lin and Shaobo Ma and Mo Zhou and Xue Li and Zhongfeng Wang and Chengying Huan and Rong Gu and Kun Yang and Guihai Chen and Sheng Zhong and Chen Tian },
  journal={arXiv preprint arXiv:2505.17694},
  year={ 2025 }
}
Comments on this paper