v1v2 (latest)

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

5 November 2024

Hongpeng Jin

Yanzhao Wu

ArXiv (abs)PDF HTML

Main:6 Pages

4 Figures

Bibliography:2 Pages

Abstract

Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge Collaboration framework for LLMs (CE-CoLLM) to tackle these challenges. First, we identify the transmission of LLM contextual data between the cloud and edge as a key performance bottleneck, which introduces substantial communication overhead that dominates overall inference latency and makes naïve cloud-edge collaboration for LLMs inefficient. Second, we introduce a suite of novel techniques, including a latency-aware early exit mechanism and efficient cloud context management, into CE-CoLLM, which collectively reduce communication overhead and preserve LLM inference accuracy. Third, we design two adaptive inference modes to accommodate diverse edge environments: (1) a low-latency standalone edge inference mode that enables reliable edge-side independent LLM inference even under unstable network conditions, and (2) a high-accuracy cloud-edge collaborative inference mode that adaptively leverages cloud resources to enhance prediction accuracy. Extensive experiments on multiple benchmark datasets demonstrate that CE-CoLLM reduces overall inference time by up to 13.81% and offloads over 84.53% of the computational workload from the cloud to the edge, compared to conventional cloud-based LLM deployment, without sacrificing prediction accuracy. The code is provided on GitHub atthis https URL.

View on arXiv

@article{jin2025_2411.02829,
  title={ CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration },
  author={ Hongpeng Jin and Yanzhao Wu },
  journal={arXiv preprint arXiv:2411.02829},
  year={ 2025 }
}

Comments on this paper