Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

16 May 2025

Abstract

Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

View on arXiv

@article{couturier2025_2505.11271,
  title={ Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models },
  author={ Camille Couturier and Spyros Mastorakis and Haiying Shen and Saravan Rajmohan and Victor Rühle },
  journal={arXiv preprint arXiv:2505.11271},
  year={ 2025 }
}

Comments on this paper