As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to First Token (TTFT) and Time Between Tokens (TBT). Furthermore, there are no long context inference solutions that allow batching requests to increase the hardware utilization today.

View on arXiv

@article{agrawal2025_2409.17264,
  title={ Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations },
  author={ Amey Agrawal and Haoran Qiu and Junda Chen and Íñigo Goiri and Chaojie Zhang and Rayyan Shahid and Ramachandran Ramjee and Alexey Tumanov and Esha Choukse },
  journal={arXiv preprint arXiv:2409.17264},
  year={ 2025 }
}

Comments on this paper