As large language models (LLMs) handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints -- like Time to First Token (TTFT) and Time per Output Token (TPOT). Furthermore, no long-context inference solutions address head-of-line blocking today.
View on arXiv@article{agrawal2025_2409.17264, title={ Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations }, author={ Amey Agrawal and Haoran Qiu and Junda Chen and Íñigo Goiri and Chaojie Zhang and Rayyan Shahid and Ramachandran Ramjee and Alexey Tumanov and Esha Choukse }, journal={arXiv preprint arXiv:2409.17264}, year={ 2025 } }