MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

27 March 2025

Abstract

With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs, which are suboptimal for MoE quantization, due to MoEs' complex data-model distribution. To address this problem, we decouple the complexity of MoEs' data-model distribution into a multi-stage analysis and reveal MoEs' inherent dynamics. The analysis results show that the expert performance of MoE varies dynamically both within and across data distributions. Based on these, we design two quantization strategies with data-model distribution awareness and integrate them into an end-to-end framework for MoE quantization, which is named MoQa. MoQa uses an expert-level mix-precision base quantization with distribution awareness. Moreover, MoQa uses a channel-level quantization adjustment to dynamically adjust expert performance to adapt to novel distributions. Experiments show that MoQa's base quantization achieves a 0.49~8.51 PPL decrease on known distributions. With the adjustments, MoQa achieves a 2.74~6.44 PPL decrease and 1.85%~3.77% average accuracy improvements on novel distributions. We believe MoQa will play a role in future MoE construction, optimization, and compression.

View on arXiv

@article{zheng2025_2503.21135,
  title={ MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness },
  author={ Zihao Zheng and Xiuping Cui and Size Zheng and Maoliang Li and Jiayu Chen and Yun Liang and Xiang Chen },
  journal={arXiv preprint arXiv:2503.21135},
  year={ 2025 }
}

Comments on this paper