Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

17 May 2025

Abstract

To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206 $\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

View on arXiv

@article{oh2025_2505.11788,
  title={ Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission },
  author={ Seungeun Oh and Jinhyuk Kim and Jihong Park and Seung-Woo Ko and Jinho Choi and Tony Q. S. Quek and Seong-Lyun Kim },
  journal={arXiv preprint arXiv:2505.11788},
  year={ 2025 }
}

Comments on this paper