Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

13 June 2025

Main:8 Pages

7 Figures

Bibliography:2 Pages

3 Tables

Abstract

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

View on arXiv

@article{liu2025_2506.11886,
  title={ Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache },
  author={ Xiaoran Liu and Siyang He and Qiqi Wang and Ruixiao Li and Yuerong Song and Zhigeng Liu and Linlin Li and Qun Liu and Zengfeng Huang and Qipeng Guo and Ziwei He and Xipeng Qiu },
  journal={arXiv preprint arXiv:2506.11886},
  year={ 2025 }
}

Comments on this paper