ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor

14 May 2025

Abstract

We propose ELIS, a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler designed to efficiently manage inference tasks with the shortest remaining tokens. Current LLM serving systems often employ a first-come-first-served scheduling strategy, which can lead to the "head-of-line blocking" problem. To overcome this limitation, it is necessary to predict LLM inference times and apply a shortest job first scheduling strategy. However, due to the auto-regressive nature of LLMs, predicting the inference latency is challenging. ELIS addresses this challenge by training a response length predictor for LLMs using the BGE model, an encoder-based state-of-the-art model. Additionally, we have devised the ISRTF scheduling strategy, an optimization of shortest remaining time first tailored to existing LLM iteration batching. To evaluate our work in an industrial setting, we simulate streams of requests based on our study of real-world user LLM serving trace records. Furthermore, we implemented ELIS as a cloud-native scheduler system on Kubernetes to evaluate its performance in production environments. Our experimental results demonstrate that ISRTF reduces the average job completion time by up to 19.6%.

View on arXiv

@article{choi2025_2505.09142,
  title={ ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor },
  author={ Seungbeom Choi and Jeonghoe Goo and Eunjoo Jeon and Mingyu Yang and Minsung Jang },
  journal={arXiv preprint arXiv:2505.09142},
  year={ 2025 }
}

Comments on this paper