KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

17 March 2025

Huan Yang

Renji Zhang

Mingzhe Huang

ArXiv (abs)PDF HTML

Main:9 Pages

12 Figures

Bibliography:2 Pages

1 Tables

Appendix:11 Pages

Abstract

This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

View on arXiv

@article{yang2025_2503.16525,
  title={ KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse },
  author={ Huan Yang and Renji Zhang and Mingzhe Huang and Weijun Wang and Yin Tang and Yuanchun Li and Yunxin Liu and Deyu Zhang },
  journal={arXiv preprint arXiv:2503.16525},
  year={ 2025 }
}

Comments on this paper