MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware.Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88 compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.
View on arXiv@article{jin2025_2505.11432, title={ MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production }, author={ Chao Jin and Ziheng Jiang and Zhihao Bai and Zheng Zhong and Juncai Liu and Xiang Li and Ningxin Zheng and Xi Wang and Cong Xie and Qi Huang and Wen Heng and Yiyuan Ma and Wenlei Bao and Size Zheng and Yanghua Peng and Haibin Lin and Xuanzhe Liu and Xin Jin and Xin Liu }, journal={arXiv preprint arXiv:2505.11432}, year={ 2025 } }