Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\ $) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\ $workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV\$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

View on arXiv

@article{wang2025_2506.02634,
  title={ KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider },
  author={ Jiahao Wang and Jinbo Han and Xingda Wei and Sijie Shen and Dingyan Zhang and Chenguang Fang and Rong Chen and Wenyuan Yu and Haibo Chen },
  journal={arXiv preprint arXiv:2506.02634},
  year={ 2025 }
}

Comments on this paper