106
0

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

Main:12 Pages
26 Figures
Bibliography:2 Pages
2 Tables
Abstract

As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to First Token (TTFT) and Time Between Tokens (TBT). Furthermore, there are no long context inference solutions that allow batching requests to increase the hardware utilization today.

View on arXiv
@article{agrawal2025_2409.17264,
  title={ Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations },
  author={ Amey Agrawal and Haoran Qiu and Junda Chen and Íñigo Goiri and Chaojie Zhang and Rayyan Shahid and Ramachandran Ramjee and Alexey Tumanov and Esha Choukse },
  journal={arXiv preprint arXiv:2409.17264},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.