ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.16716
  4. Cited By
BASE Layers: Simplifying Training of Large, Sparse Models

BASE Layers: Simplifying Training of Large, Sparse Models

30 March 2021
M. Lewis
Shruti Bhosale
Tim Dettmers
Naman Goyal
Luke Zettlemoyer
    MoE
ArXivPDFHTML

Papers citing "BASE Layers: Simplifying Training of Large, Sparse Models"

50 / 208 papers shown
Title
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and
  Scalable Large Mixture-of-Experts Models
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Zhixu Du
Shiyu Li
Yuhao Wu
Xiangyu Jiang
Jingwei Sun
Qilin Zheng
Yongkai Wu
Ang Li
Hai Helen Li
Yiran Chen
MoE
43
13
0
29 Oct 2023
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Szymon Antoniak
Sebastian Jaszczur
Michal Krutul
Maciej Pióro
Jakub Krajewski
Jan Ludziejewski
Tomasz Odrzygó'zd'z
Marek Cygan
MoE
16
1
0
24 Oct 2023
Large Language Models are Visual Reasoning Coordinators
Large Language Models are Visual Reasoning Coordinators
Liangyu Chen
Bo Li
Sheng Shen
Jingkang Yang
Chunyuan Li
Kurt Keutzer
Trevor Darrell
Ziwei Liu
VLM
LRM
41
51
0
23 Oct 2023
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
MoE
27
18
0
16 Oct 2023
G-SPEED: General SParse Efficient Editing MoDel
G-SPEED: General SParse Efficient Editing MoDel
Haoke Zhang
Yue Wang
Juntao Li
Xiabing Zhou
Min Zhang
SyDa
KELM
30
1
0
16 Oct 2023
Diversifying the Mixture-of-Experts Representation for Language Models
  with Orthogonal Optimizer
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer
Boan Liu
Liang Ding
Li Shen
Keqin Peng
Yu Cao
Dazhao Cheng
Dacheng Tao
MoE
36
7
0
15 Oct 2023
Sparse Backpropagation for MoE Training
Sparse Backpropagation for MoE Training
Liyuan Liu
Jianfeng Gao
Weizhu Chen
MoE
34
9
0
01 Oct 2023
ConPET: Continual Parameter-Efficient Tuning for Large Language Models
ConPET: Continual Parameter-Efficient Tuning for Large Language Models
Chenyan Song
Xu Han
Zheni Zeng
Kuai Li
Chen Chen
Zhiyuan Liu
Maosong Sun
Taojiannan Yang
CLL
KELM
21
10
0
26 Sep 2023
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
  MoE for Instruction Tuning
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
Ted Zadouri
Ahmet Üstün
Arash Ahmadian
Beyza Ermics
Acyr Locatelli
Sara Hooker
MoE
42
89
0
11 Sep 2023
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable
  Mixture-of-Expert Inference
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Ranggi Hwang
Jianyu Wei
Shijie Cao
Changho Hwang
Xiaohu Tang
Ting Cao
Mao Yang
MoE
58
41
0
23 Aug 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and
  Modality-Aware MoE
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
60
9
0
23 Aug 2023
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer
  with Mixture-of-View-Experts
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts
Wenyan Cong
Hanxue Liang
Peihao Wang
Zhiwen Fan
Tianlong Chen
M. Varma
Yi Wang
Zhangyang Wang
MoE
37
21
0
22 Aug 2023
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Robust Mixture-of-Expert Training for Convolutional Neural Networks
Yihua Zhang
Ruisi Cai
Tianlong Chen
Guanhua Zhang
Huan Zhang
Pin-Yu Chen
Shiyu Chang
Zhangyang Wang
Sijia Liu
MoE
AAML
OOD
41
16
0
19 Aug 2023
Experts Weights Averaging: A New General Training Scheme for Vision
  Transformers
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Yongqian Huang
Peng Ye
Xiaoshui Huang
Sheng Li
Tao Chen
Tong He
Wanli Ouyang
MoMe
36
8
0
11 Aug 2023
From Sparse to Soft Mixtures of Experts
From Sparse to Soft Mixtures of Experts
J. Puigcerver
C. Riquelme
Basil Mustafa
N. Houlsby
MoE
121
114
0
02 Aug 2023
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to
  Pre-trained Language Models Memories
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories
Shizhe Diao
Tianyang Xu
Ruijia Xu
Jiawei Wang
Tong Zhang
MoE
AI4CE
13
36
0
08 Jun 2023
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
  for Convolutional Neural Networks
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks
Mohammed Nowaz Rabbani Chowdhury
Shuai Zhang
Ming Wang
Sijia Liu
Pin-Yu Chen
MoE
29
18
0
07 Jun 2023
Soft Merging of Experts with Adaptive Routing
Soft Merging of Experts with Adaptive Routing
Mohammed Muqeeth
Haokun Liu
Colin Raffel
MoMe
MoE
37
46
0
06 Jun 2023
COMET: Learning Cardinality Constrained Mixture of Experts with Trees
  and Local Search
COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search
Shibal Ibrahim
Wenyu Chen
Hussein Hazimeh
Natalia Ponomareva
Zhe Zhao
Rahul Mazumder
MoE
32
3
0
05 Jun 2023
Brainformers: Trading Simplicity for Efficiency
Brainformers: Trading Simplicity for Efficiency
Yan-Quan Zhou
Nan Du
Yanping Huang
Daiyi Peng
Chang Lan
...
Zhifeng Chen
Quoc V. Le
Claire Cui
J.H.J. Laundon
J. Dean
MoE
24
23
0
29 May 2023
Emergent Modularity in Pre-trained Transformers
Emergent Modularity in Pre-trained Transformers
Zhengyan Zhang
Zhiyuan Zeng
Yankai Lin
Chaojun Xiao
Xiaozhi Wang
Xu Han
Zhiyuan Liu
Ruobing Xie
Maosong Sun
Jie Zhou
MoE
47
24
0
28 May 2023
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
  Large Language Models
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Sheng Shen
Le Hou
Yan-Quan Zhou
Nan Du
Shayne Longpre
...
Vincent Zhao
Hongkun Yu
Kurt Keutzer
Trevor Darrell
Denny Zhou
ALM
MoE
40
54
0
24 May 2023
Getting MoRE out of Mixture of Language Model Reasoning Experts
Getting MoRE out of Mixture of Language Model Reasoning Experts
Chenglei Si
Weijia Shi
Chen Zhao
Luke Zettlemoyer
Jordan L. Boyd-Graber
LRM
28
23
0
24 May 2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining
  Large Language Model
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Leo Liu
Tim Dettmers
Xi Lin
Ves Stoyanov
Xian Li
MoE
26
9
0
23 May 2023
Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized
  Language Models
Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models
Shangbin Feng
Weijia Shi
Yuyang Bai
Vidhisha Balachandran
Tianxing He
Yulia Tsvetkov
KELM
52
31
0
17 May 2023
Towards Being Parameter-Efficient: A Stratified Sparsely Activated
  Transformer with Dynamic Capacity
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Da Xu
Maha Elbayad
Kenton W. Murray
Jean Maillard
Vedanuj Goswami
MoE
47
3
0
03 May 2023
Conditional Adapters: Parameter-efficient Transfer Learning with Fast
  Inference
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Tao Lei
Junwen Bai
Siddhartha Brahma
Joshua Ainslie
Kenton Lee
...
Vincent Zhao
Yuexin Wu
Bo-wen Li
Yu Zhang
Ming-Wei Chang
BDL
AI4CE
30
55
0
11 Apr 2023
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via
  Dynamic Device Placement
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
Xiaonan Nie
Xupeng Miao
Zilong Wang
Zichao Yang
Jilong Xue
Lingxiao Ma
Gang-Ming Cao
Bin Cui
MoE
39
44
0
08 Apr 2023
Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit
  Diversity Modeling
Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
Haotao Wang
Ziyu Jiang
Yuning You
Yan Han
Gaowen Liu
Jayanth Srinivasa
Ramana Rao Kompella
Zhangyang Wang
36
29
0
06 Apr 2023
Scaling Expert Language Models with Unsupervised Domain Discovery
Scaling Expert Language Models with Unsupervised Domain Discovery
Suchin Gururangan
Margaret Li
M. Lewis
Weijia Shi
Tim Althoff
Noah A. Smith
Luke Zettlemoyer
MoE
30
46
0
24 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
26
63
0
13 Mar 2023
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
  Transformers
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Tianlong Chen
Zhenyu Zhang
Ajay Jaiswal
Shiwei Liu
Zhangyang Wang
MoE
43
46
0
02 Mar 2023
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource
  Visual Question Answering
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jingjing Jiang
Nanning Zheng
MoE
40
6
0
02 Mar 2023
Improving Expert Specialization in Mixture of Experts
Improving Expert Specialization in Mixture of Experts
Yamuna Krishnamurthy
C. Watkins
Thomas Gaertner
MoE
21
7
0
28 Feb 2023
Modular Deep Learning
Modular Deep Learning
Jonas Pfeiffer
Sebastian Ruder
Ivan Vulić
Edoardo Ponti
MoMe
OOD
34
73
0
22 Feb 2023
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
Chang-Qin Chen
Min Li
Zhihua Wu
Dianhai Yu
Chao Yang
MoE
21
14
0
20 Feb 2023
Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook
  for Sparse Neural Network Researchers
Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers
Shiwei Liu
Zhangyang Wang
37
30
0
06 Feb 2023
Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective
Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective
Michael E. Sander
J. Puigcerver
Josip Djolonga
Gabriel Peyré
Mathieu Blondel
21
19
0
02 Feb 2023
The Power of External Memory in Increasing Predictive Model Capacity
The Power of External Memory in Increasing Predictive Model Capacity
Cenk Baykal
D. Cutler
Nishanth Dikkala
Nikhil Ghosh
Rina Panigrahy
Xin Wang
KELM
18
0
0
31 Jan 2023
Alternating Updates for Efficient Transformers
Alternating Updates for Efficient Transformers
Cenk Baykal
D. Cutler
Nishanth Dikkala
Nikhil Ghosh
Rina Panigrahy
Xin Wang
MoE
48
5
0
30 Jan 2023
Gated Self-supervised Learning For Improving Supervised Learning
Gated Self-supervised Learning For Improving Supervised Learning
Erland Hilman Fuadi
Aristo Renaldo Ruslim
Putu Wahyu Kusuma Wardhana
N. Yudistira
SSL
28
0
0
14 Jan 2023
AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for
  Click-Through Rate Prediction
AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction
Yachen Yan
Liubo Li
22
3
0
06 Jan 2023
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners
Zitian Chen
Songlin Yang
Mingyu Ding
Zhenfang Chen
Hengshuang Zhao
E. Learned-Miller
Chuang Gan
MoE
26
14
0
15 Dec 2022
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual
  Machine Translation
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation
Maha Elbayad
Anna Y. Sun
Shruti Bhosale
MoE
59
9
0
15 Dec 2022
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
Chaoyang He
Shuai Zheng
Aston Zhang
George Karypis
Trishul Chilimbi
Mahdi Soltanolkotabi
Salman Avestimehr
MoE
28
1
0
10 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMe
MoE
29
109
0
09 Dec 2022
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Trevor Gale
Deepak Narayanan
C. Young
Matei A. Zaharia
MoE
30
103
0
29 Nov 2022
Spatial Mixture-of-Experts
Spatial Mixture-of-Experts
Nikoli Dryden
Torsten Hoefler
MoE
36
9
0
24 Nov 2022
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
  Speech-to-Speech Translations
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
Paul-Ambroise Duquenne
Hongyu Gong
Ning Dong
Jingfei Du
Ann Lee
Vedanuj Goswani
Changhan Wang
J. Pino
Benoît Sagot
Holger Schwenk
45
34
0
08 Nov 2022
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
Yaqing Wang
Sahaj Agarwal
Subhabrata Mukherjee
Xiaodong Liu
Jing Gao
Ahmed Hassan Awadallah
Jianfeng Gao
MoE
24
118
0
31 Oct 2022
Previous
12345
Next