Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2201.05596
Cited By
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
14 January 2022
Samyam Rajbhandari
Conglong Li
Z. Yao
Minjia Zhang
Reza Yazdani Aminabadi
A. A. Awan
Jeff Rasley
Yuxiong He
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale"
39 / 189 papers shown
Title
Computation vs. Communication Scaling for Future Transformers on Future Hardware
Suchita Pati
Shaizeen Aga
Mahzabeen Islam
Nuwan Jayasena
Matthew D. Sinclair
28
9
0
06 Feb 2023
Alternating Updates for Efficient Transformers
Cenk Baykal
D. Cutler
Nishanth Dikkala
Nikhil Ghosh
Rina Panigrahy
Xin Wang
MoE
48
5
0
30 Jan 2023
The Effects of In-domain Corpus Size on pre-training BERT
Chris Sanchez
Zheyu Zhang
AI4CE
16
4
0
15 Dec 2022
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation
Maha Elbayad
Anna Y. Sun
Shruti Bhosale
MoE
54
9
0
15 Dec 2022
Elixir: Train a Large Language Model on a Small GPU Cluster
Haichen Huang
Jiarui Fang
Hongxin Liu
Shenggui Li
Yang You
VLM
24
7
0
10 Dec 2022
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
Chaoyang He
Shuai Zheng
Aston Zhang
George Karypis
Trishul Chilimbi
Mahdi Soltanolkotabi
Salman Avestimehr
MoE
20
1
0
10 Dec 2022
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
Conglong Li
Z. Yao
Xiaoxia Wu
Minjia Zhang
Connor Holmes
Cheng Li
Yuxiong He
27
24
0
07 Dec 2022
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
Young Jin Kim
Rawn Henry
Raffy Fahim
Hany Awadalla
MoE
37
23
0
18 Nov 2022
Accelerating Distributed MoE Training and Inference with Lina
Jiamin Li
Yimin Jiang
Yibo Zhu
Cong Wang
Hong-Yu Xu
MoE
25
58
0
31 Oct 2022
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
Ganesh Jawahar
Subhabrata Mukherjee
Xiaodong Liu
Young Jin Kim
Muhammad Abdul-Mageed
L. Lakshmanan
Ahmed Hassan Awadallah
Sébastien Bubeck
Jianfeng Gao
MoE
33
5
0
14 Oct 2022
Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities
Brian Bartoldson
B. Kailkhura
Davis W. Blalock
31
47
0
13 Oct 2022
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Zong-xiao Li
Chong You
Srinadh Bhojanapalli
Daliang Li
A. S. Rawat
...
Kenneth Q Ye
Felix Chern
Felix X. Yu
Ruiqi Guo
Surinder Kumar
MoE
27
87
0
12 Oct 2022
Mixture of Attention Heads: Selecting Attention Heads Per Token
Xiaofeng Zhang
Songlin Yang
Zeyu Huang
Jie Zhou
Wenge Rong
Zhang Xiong
MoE
99
42
0
11 Oct 2022
Social and environmental impact of recent developments in machine learning on biology and chemistry research
Daniel Probst
22
1
0
01 Oct 2022
A Review of Sparse Expert Models in Deep Learning
W. Fedus
J. Dean
Barret Zoph
MoE
20
144
0
04 Sep 2022
Efficient Methods for Natural Language Processing: A Survey
Marcos Vinícius Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
...
Emma Strubell
Niranjan Balasubramanian
Leon Derczynski
Iryna Gurevych
Roy Schwartz
30
109
0
31 Aug 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Reza Yazdani Aminabadi
Samyam Rajbhandari
Minjia Zhang
A. A. Awan
Cheng-rong Li
...
Elton Zheng
Jeff Rasley
Shaden Smith
Olatunji Ruwase
Yuxiong He
31
337
0
30 Jun 2022
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
Jack G. M. FitzGerald
Shankar Ananthakrishnan
Konstantine Arkoudas
Davide Bernardi
Abhishek Bhagia
...
Pan Wei
Haiyang Yu
Shuai Zheng
Gokhan Tur
Premkumar Natarajan
ELM
14
30
0
15 Jun 2022
Tutel: Adaptive Mixture-of-Experts at Scale
Changho Hwang
Wei Cui
Yifan Xiong
Ziyue Yang
Ze Liu
...
Joe Chau
Peng Cheng
Fan Yang
Mao Yang
Y. Xiong
MoE
106
111
0
07 Jun 2022
Task-Specific Expert Pruning for Sparse Mixture-of-Experts
Tianyu Chen
Shaohan Huang
Yuan Xie
Binxing Jiao
Daxin Jiang
Haoyi Zhou
Jianxin Li
Furu Wei
MoE
32
39
0
01 Jun 2022
Lossless Acceleration for Seq2seq Generation with Aggressive Decoding
Tao Ge
Heming Xia
Xin Sun
Si-Qing Chen
Furu Wei
85
18
0
20 May 2022
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability
Svetlana Pavlitska
Christian Hubschneider
Lukas Struppek
J. Marius Zöllner
MoE
32
11
0
22 Apr 2022
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System
Xiaonan Nie
Pinxue Zhao
Xupeng Miao
Tong Zhao
Bin Cui
MoE
21
36
0
28 Mar 2022
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models
Ze-Feng Gao
Peiyu Liu
Wayne Xin Zhao
Zhong-Yi Lu
Ji-Rong Wen
MoE
24
27
0
02 Mar 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
27
30
0
02 Mar 2022
A Survey on Dynamic Neural Networks for Natural Language Processing
Canwen Xu
Julian McAuley
AI4CE
30
28
0
15 Feb 2022
Efficient Direct-Connect Topologies for Collective Communications
Liangyu Zhao
Siddharth Pal
Tapan Chugh
Weiyang Wang
Jason Fantl
P. Basu
J. Khoury
Arvind Krishnamurthy
33
6
0
07 Feb 2022
Unified Scaling Laws for Routed Language Models
Aidan Clark
Diego de Las Casas
Aurelia Guy
A. Mensch
Michela Paganini
...
Oriol Vinyals
Jack W. Rae
Erich Elsen
Koray Kavukcuoglu
Karen Simonyan
MoE
27
177
0
02 Feb 2022
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
Weiyang Wang
Moein Khazraee
Zhizhen Zhong
M. Ghobadi
Zhihao Jia
Dheevatsa Mudigere
Ying Zhang
A. Kewitsch
39
81
0
01 Feb 2022
One Student Knows All Experts Know: From Sparse to Dense
Fuzhao Xue
Xiaoxin He
Xiaozhe Ren
Yuxuan Lou
Yang You
MoMe
MoE
35
20
0
26 Jan 2022
Efficient Large Scale Language Modeling with Mixtures of Experts
Mikel Artetxe
Shruti Bhosale
Naman Goyal
Todor Mihaylov
Myle Ott
...
Jeff Wang
Luke Zettlemoyer
Mona T. Diab
Zornitsa Kozareva
Ves Stoyanov
MoE
61
188
0
20 Dec 2021
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining
Junyang Lin
An Yang
Jinze Bai
Chang Zhou
Le Jiang
...
Jie Zhang
Yong Li
Wei Lin
Jingren Zhou
Hongxia Yang
MoE
92
43
0
08 Oct 2021
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
Zhengyan Zhang
Yankai Lin
Zhiyuan Liu
Peng Li
Maosong Sun
Jie Zhou
MoE
27
117
0
05 Oct 2021
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
William Won
Saeed Rashidi
Sudarshan Srinivasan
T. Krishna
AI4CE
22
7
0
24 Sep 2021
Scalable and Efficient MoE Training for Multitask Multilingual Models
Young Jin Kim
A. A. Awan
Alexandre Muzio
Andres Felipe Cruz Salinas
Liyang Lu
Amr Hendy
Samyam Rajbhandari
Yuxiong He
Hany Awadalla
MoE
104
84
0
22 Sep 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,826
0
17 Sep 2019
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
Sheng Shen
Zhen Dong
Jiayu Ye
Linjian Ma
Z. Yao
A. Gholami
Michael W. Mahoney
Kurt Keutzer
MQ
236
576
0
12 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,984
0
20 Apr 2018
Previous
1
2
3
4