Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.16668
Cited By
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
30 June 2020
Dmitry Lepikhin
HyoukJoong Lee
Yuanzhong Xu
Dehao Chen
Orhan Firat
Yanping Huang
M. Krikun
Noam M. Shazeer
Z. Chen
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding"
50 / 260 papers shown
Title
Spatial Mixture-of-Experts
Nikoli Dryden
Torsten Hoefler
MoE
34
9
0
24 Nov 2022
Intelligent Computing: The Latest Advances, Challenges and Future
Shiqiang Zhu
Ting Yu
Tao Xu
Hongyang Chen
Schahram Dustdar
...
Tariq S. Durrani
Huaimin Wang
Jiangxing Wu
Tongyi Zhang
Yunhe Pan
AI4CE
27
117
0
21 Nov 2022
HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization
Jingang Qu
T. Faney
Zehao Wang
Patrick Gallinari
Soleiman Yousef
J. D. Hemptinne
OOD
24
7
0
15 Nov 2022
Efficiently Scaling Transformer Inference
Reiner Pope
Sholto Douglas
Aakanksha Chowdhery
Jacob Devlin
James Bradbury
Anselm Levskaya
Jonathan Heek
Kefan Xiao
Shivani Agrawal
J. Dean
34
295
0
09 Nov 2022
QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments
Hong Wang
Connor Imes
Souvik Kundu
P. Beerel
S. Crago
J. Walters
MQ
21
7
0
08 Nov 2022
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
Paul-Ambroise Duquenne
Hongyu Gong
Ning Dong
Jingfei Du
Ann Lee
Vedanuj Goswani
Changhan Wang
J. Pino
Benoît Sagot
Holger Schwenk
42
34
0
08 Nov 2022
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Yaqing Wang
Sahaj Agarwal
Subhabrata Mukherjee
Xiaodong Liu
Jing Gao
Ahmed Hassan Awadallah
Jianfeng Gao
MoE
22
118
0
31 Oct 2022
M
3
^3
3
ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
Hanxue Liang
Zhiwen Fan
Rishov Sarkar
Ziyu Jiang
Tianlong Chen
Kai Zou
Yu Cheng
Cong Hao
Zhangyang Wang
MoE
42
81
0
26 Oct 2022
Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers
Wanjun Zhong
Tingting Ma
Jiahai Wang
Jian Yin
T. Zhao
Chin-Yew Lin
Nan Duan
LRM
CoGe
33
2
0
20 Oct 2022
On the Adversarial Robustness of Mixture of Experts
J. Puigcerver
Rodolphe Jenatton
C. Riquelme
Pranjal Awasthi
Srinadh Bhojanapalli
OOD
AAML
MoE
45
18
0
19 Oct 2022
AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
Dacheng Li
Hongyi Wang
Eric P. Xing
Haotong Zhang
MoE
22
20
0
13 Oct 2022
Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts
Tao Zhong
Zhixiang Chi
Li Gu
Yang Wang
Yuanhao Yu
Jingshan Tang
OOD
66
29
0
08 Oct 2022
Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion
Xiaodong Yi
Shiwei Zhang
Lansong Diao
Chuan Wu
Zhen Zheng
Shiqing Fan
Siyu Wang
Jun Yang
W. Lin
39
4
0
26 Sep 2022
HammingMesh: A Network Topology for Large-Scale Deep Learning
Torsten Hoefler
Tommaso Bonato
Daniele De Sensi
Salvatore Di Girolamo
Shigang Li
Marco Heddes
Jon Belk
Deepak Goel
Miguel Castro
Steve Scott
3DH
GNN
AI4CE
29
20
0
03 Sep 2022
Efficient Methods for Natural Language Processing: A Survey
Marcos Vinícius Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
...
Emma Strubell
Niranjan Balasubramanian
Leon Derczynski
Iryna Gurevych
Roy Schwartz
30
109
0
31 Aug 2022
Efficient Sparsely Activated Transformers
Salar Latifi
Saurav Muralidharan
M. Garland
MoE
24
2
0
31 Aug 2022
The Neural Race Reduction: Dynamics of Abstraction in Gated Networks
Andrew M. Saxe
Shagun Sodhani
Sam Lewallen
AI4CE
30
34
0
21 Jul 2022
MoEC: Mixture of Expert Clusters
Yuan Xie
Shaohan Huang
Tianyu Chen
Furu Wei
MoE
40
11
0
19 Jul 2022
Neural Implicit Dictionary via Mixture-of-Expert Training
Peihao Wang
Zhiwen Fan
Tianlong Chen
Zhangyang Wang
25
12
0
08 Jul 2022
Machine Learning Model Sizes and the Parameter Gap
Pablo Villalobos
J. Sevilla
T. Besiroglu
Lennart Heim
A. Ho
Marius Hobbhahn
ALM
ELM
AI4CE
30
58
0
05 Jul 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Reza Yazdani Aminabadi
Samyam Rajbhandari
Minjia Zhang
A. A. Awan
Cheng-rong Li
...
Elton Zheng
Jeff Rasley
Shaden Smith
Olatunji Ruwase
Yuxiong He
31
337
0
30 Jun 2022
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
69
2,354
0
15 Jun 2022
Tutel: Adaptive Mixture-of-Experts at Scale
Changho Hwang
Wei Cui
Yifan Xiong
Ziyue Yang
Ze Liu
...
Joe Chau
Peng Cheng
Fan Yang
Mao Yang
Y. Xiong
MoE
106
111
0
07 Jun 2022
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
R. Liu
Young Jin Kim
Alexandre Muzio
Hany Awadalla
MoE
50
22
0
28 May 2022
Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel
Ryuichi Kanoh
M. Sugiyama
31
2
0
25 May 2022
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
James Lee-Thorp
Joshua Ainslie
MoE
34
11
0
24 May 2022
TransTab: Learning Transferable Tabular Transformers Across Tables
Zifeng Wang
Jimeng Sun
LMTD
36
137
0
19 May 2022
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
57
297
0
10 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
79
1,262
0
04 May 2022
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
Zhen Zhang
Shuai Zheng
Yida Wang
Justin Chiu
George Karypis
Trishul Chilimbi
Mu Li
Xin Jin
19
39
0
30 Apr 2022
Residual Mixture of Experts
Lemeng Wu
Mengchen Liu
Yinpeng Chen
Dongdong Chen
Xiyang Dai
Lu Yuan
MoE
22
36
0
20 Apr 2022
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Simiao Zuo
Qingru Zhang
Chen Liang
Pengcheng He
T. Zhao
Weizhu Chen
MoE
24
38
0
15 Apr 2022
PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems
Yuanxing Zhang
Langshi Chen
Siran Yang
Man Yuan
Hui-juan Yi
...
Yong Li
Dingyang Zhang
Wei Lin
Lin Qu
Bo Zheng
35
32
0
11 Apr 2022
3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition
Zhao You
Shulin Feng
Dan Su
Dong Yu
22
9
0
07 Apr 2022
Pathways: Asynchronous Distributed Dataflow for ML
P. Barham
Aakanksha Chowdhery
J. Dean
Sanjay Ghemawat
Steven Hand
...
Parker Schuh
Ryan Sepassi
Laurent El Shafey
C. A. Thekkath
Yonghui Wu
GNN
MoE
45
126
0
23 Mar 2022
Efficient Language Modeling with Sparse all-MLP
Ping Yu
Mikel Artetxe
Myle Ott
Sam Shleifer
Hongyu Gong
Ves Stoyanov
Xian Li
MoE
23
11
0
14 Mar 2022
SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural Language Understanding
Fan Zhang
Duyu Tang
Yong Dai
Cong Zhou
Shuangzhi Wu
Shuming Shi
CLL
MoE
33
12
0
07 Mar 2022
Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions
Ruikang Ju
Ting-Yu Lin
Jen-Shiun Chiang
Jia-Hao Jian
Yu-Shian Lin
Liu-Rui-Yi Huang
ViT
16
1
0
02 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
26
156
0
01 Mar 2022
BagPipe: Accelerating Deep Recommendation Model Training
Saurabh Agarwal
Chengpo Yan
Ziyi Zhang
Shivaram Venkataraman
37
17
0
24 Feb 2022
Mixture-of-Experts with Expert Choice Routing
Yan-Quan Zhou
Tao Lei
Han-Chu Liu
Nan Du
Yanping Huang
Vincent Zhao
Andrew M. Dai
Zhifeng Chen
Quoc V. Le
James Laudon
MoE
160
329
0
18 Feb 2022
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph
Irwan Bello
Sameer Kumar
Nan Du
Yanping Huang
J. Dean
Noam M. Shazeer
W. Fedus
MoE
24
182
0
17 Feb 2022
Compute Trends Across Three Eras of Machine Learning
J. Sevilla
Lennart Heim
A. Ho
T. Besiroglu
Marius Hobbhahn
Pablo Villalobos
39
269
0
11 Feb 2022
Efficient Direct-Connect Topologies for Collective Communications
Liangyu Zhao
Siddharth Pal
Tapan Chugh
Weiyang Wang
Jason Fantl
P. Basu
J. Khoury
Arvind Krishnamurthy
33
6
0
07 Feb 2022
The Ecological Footprint of Neural Machine Translation Systems
D. Shterionov
Eva Vanmassenhove
40
3
0
04 Feb 2022
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
Youjie Li
Amar Phanishayee
D. Murray
Jakub Tarnawski
N. Kim
19
19
0
02 Feb 2022
Unified Scaling Laws for Routed Language Models
Aidan Clark
Diego de Las Casas
Aurelia Guy
A. Mensch
Michela Paganini
...
Oriol Vinyals
Jack W. Rae
Erich Elsen
Koray Kavukcuoglu
Karen Simonyan
MoE
27
177
0
02 Feb 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
M. Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
90
732
0
28 Jan 2022
One Student Knows All Experts Know: From Sparse to Dense
Fuzhao Xue
Xiaoxin He
Xiaozhe Ren
Yuxuan Lou
Yang You
MoMe
MoE
35
20
0
26 Jan 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Samyam Rajbhandari
Conglong Li
Z. Yao
Minjia Zhang
Reza Yazdani Aminabadi
A. A. Awan
Jeff Rasley
Yuxiong He
44
284
0
14 Jan 2022
Previous
1
2
3
4
5
6
Next