Low-Rank Key Value Attention
James OÑeill
Robert Clancy
Mariia Matskevichus
Fergal Reid
- MQVLM
Main:8 Pages
19 Figures
Bibliography:3 Pages
15 Tables
Appendix:17 Pages
Abstract
Transformer pretraining is increasingly constrained by memory and compute requirements, with the key-value (KV) cache emerging as a dominant bottleneck during training and autoregressive decoding. We propose \textit{low-rank KV adaptation} (LRKV), a simple modification of multi-head attention that reduces KV cache memory by exploiting redundancy across attention heads while preserving full token-level resolution. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, yielding a continuous trade-off between complete sharing and fully independent attention.
View on arXivComments on this paper
