ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.17264
75
0
v1v2v3 (latest)

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

25 September 2024
A. Agrawal
Haoran Qiu
Junda Chen
Íñigo Goiri
Chaojie Zhang
Rayyan Shahid
Ramachandran Ramjee
Alexey Tumanov
    RALMLRM
ArXiv (abs)PDFHTML
Main:12 Pages
26 Figures
Bibliography:2 Pages
2 Tables
Abstract

As large language models (LLMs) handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints -- like Time to First Token (TTFT) and Time per Output Token (TPOT). Furthermore, no long-context inference solutions address head-of-line blocking today.

View on arXiv
@article{agrawal2025_2409.17264,
  title={ Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations },
  author={ Amey Agrawal and Haoran Qiu and Junda Chen and Íñigo Goiri and Chaojie Zhang and Rayyan Shahid and Ramachandran Ramjee and Alexey Tumanov and Esha Choukse },
  journal={arXiv preprint arXiv:2409.17264},
  year={ 2025 }
}
Comments on this paper