Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.
View on arXiv@article{couturier2025_2505.11271, title={ Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models }, author={ Camille Couturier and Spyros Mastorakis and Haiying Shen and Saravan Rajmohan and Victor Rühle }, journal={arXiv preprint arXiv:2505.11271}, year={ 2025 } }