ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.23416
49
0

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

29 May 2025
Jang-Hyun Kim
Jinuk Kim
S. Kwon
Jae W. Lee
Sangdoo Yun
Hyun Oh Song
    MQVLM
ArXiv (abs)PDFHTML
Main:12 Pages
28 Figures
Bibliography:3 Pages
3 Tables
Appendix:6 Pages
Abstract

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4×\times× and FlashAttention decoding latency by approximately 2×\times×, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

View on arXiv
@article{kim2025_2505.23416,
  title={ KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction },
  author={ Jang-Hyun Kim and Jinuk Kim and Sangwoo Kwon and Jae W. Lee and Sangdoo Yun and Hyun Oh Song },
  journal={arXiv preprint arXiv:2505.23416},
  year={ 2025 }
}
Comments on this paper