Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.09816
Cited By
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
13 July 2024
Zhenpeng Su
Zijia Lin
Xue Bai
Xing Wu
Yizhe Xiong
Haoran Lian
Guangyuan Ma
Hui Chen
Guiguang Ding
Wei Zhou
Songlin Hu
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts"
16 / 16 papers shown
Title
A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications
Siyuan Mu
Sen Lin
MoE
348
5
0
10 Mar 2025
CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
Zhenpeng Su
Xing Wu
Zijia Lin
Yizhe Xiong
Minxuan Lv
Guangyuan Ma
Hui Chen
Songlin Hu
Guiguang Ding
MoE
51
4
0
21 Oct 2024
Temporal Scaling Law for Large Language Models
Yizhe Xiong
Xiansheng Chen
Xin Ye
Hui Chen
Zijia Lin
...
Zhenpeng Su
Wei Huang
Jianwei Niu
Jiawei Han
Guiguang Ding
73
10
0
27 Apr 2024
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph
Irwan Bello
Sameer Kumar
Nan Du
Yanping Huang
J. Dean
Noam M. Shazeer
W. Fedus
MoE
153
191
0
17 Feb 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Samyam Rajbhandari
Conglong Li
Z. Yao
Minjia Zhang
Reza Yazdani Aminabadi
A. A. Awan
Jeff Rasley
Yuxiong He
63
292
0
14 Jan 2022
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Nan Du
Yanping Huang
Andrew M. Dai
Simon Tong
Dmitry Lepikhin
...
Kun Zhang
Quoc V. Le
Yonghui Wu
Zhiwen Chen
Claire Cui
ALM
MoE
165
794
0
13 Dec 2021
Hash Layers For Large Sparse Models
Stephen Roller
Sainbayar Sukhbaatar
Arthur Szlam
Jason Weston
MoE
141
210
0
08 Jun 2021
BASE Layers: Simplifying Training of Large, Sparse Models
M. Lewis
Shruti Bhosale
Tim Dettmers
Naman Goyal
Luke Zettlemoyer
MoE
159
277
0
30 Mar 2021
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
Zhengxiao Du
Yujie Qian
Xiao Liu
Ming Ding
J. Qiu
Zhilin Yang
Jie Tang
BDL
AI4CE
100
1,520
0
18 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
402
2,051
0
31 Dec 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
483
4,662
0
23 Jan 2020
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
467
24,160
0
26 Jul 2019
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark
Kenton Lee
Ming-Wei Chang
Tom Kwiatkowski
Michael Collins
Kristina Toutanova
191
1,475
0
24 May 2019
Unified Language Model Pre-training for Natural Language Understanding and Generation
Li Dong
Nan Yang
Wenhui Wang
Furu Wei
Xiaodong Liu
Yu Wang
Jianfeng Gao
M. Zhou
H. Hon
ELM
AI4CE
170
1,553
0
08 May 2019
Deep contextualized word representations
Matthew E. Peters
Mark Neumann
Mohit Iyyer
Matt Gardner
Christopher Clark
Kenton Lee
Luke Zettlemoyer
NAI
141
11,520
0
15 Feb 2018
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
183
2,610
0
09 May 2017
1