ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.09304
45
0

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

13 March 2025
Mohammad Siavashi
Faezeh Keshmiri Dindarloo
Dejan Kostić
Marco Chiesa
    MoE
    VLM
ArXivPDFHTML
Abstract

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of 65.5×65.5\times65.5× and meets the SLO at up to 777 requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to 12.8×12.8\times12.8× without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

View on arXiv
@article{siavashi2025_2503.09304,
  title={ Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference },
  author={ Mohammad Siavashi and Faezeh Keshmiri Dindarloo and Dejan Kostic and Marco Chiesa },
  journal={arXiv preprint arXiv:2503.09304},
  year={ 2025 }
}
Comments on this paper