ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.07431
147
12

Towards More Effective and Economic Sparsely-Activated Model

14 October 2021
Hao Jiang
Ke Zhan
Jianwei Qu
Yongkang Wu
Zhaoye Fei
Xinyu Zhang
Lei Chen
Zhicheng Dou
Xipeng Qiu
Zi-Han Guo
Ruofei Lai
Jiawen Wu
Enrui Hu
Yinxia Zhang
Yantao Jia
Fan Yu
Zhao Cao
    MoE
ArXivPDFHTML
Abstract

The sparsely-activated models have achieved great success in natural language processing through large-scale parameters and relatively low computational cost, and gradually become a feasible technique for training and implementing extremely large models. Due to the limit of communication cost, activating multiple experts is hardly affordable during training and inference. Therefore, previous work usually activate just one expert at a time to alleviate additional communication cost. Such routing mechanism limits the upper bound of model performance. In this paper, we first investigate a phenomenon that increasing the number of activated experts can boost the model performance with higher sparse ratio. To increase the number of activated experts without an increase in computational cost, we propose SAM (Switch and Mixture) routing, an efficient hierarchical routing mechanism that activates multiple experts in a same device (GPU). Our methods shed light on the training of extremely large sparse models and experiments prove that our models can achieve significant performance gain with great efficiency improvement.

View on arXiv
Comments on this paper