Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.15436
Cited By
Learning to Skip for Language Modeling
26 November 2023
Dewen Zeng
Nan Du
Tao Wang
Yuanzhong Xu
Tao Lei
Zhifeng Chen
Claire Cui
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Learning to Skip for Language Modeling"
20 / 20 papers shown
Title
Adaptive Layer-skipping in Pre-trained LLMs
Xuan Luo
Weizhi Wang
Xifeng Yan
454
1
0
31 Mar 2025
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Sangmin Bae
Adam Fisch
Hrayr Harutyunyan
Ziwei Ji
Seungyeon Kim
Tal Schuster
KELM
133
7
0
28 Oct 2024
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes
Peter Kocsis
Peter Súkeník
Guillem Brasó
Matthias Nießner
Laura Leal-Taixé
Ismail Elezi
45
7
0
11 Oct 2022
Mixture-of-Experts with Expert Choice Routing
Yan-Quan Zhou
Tao Lei
Han-Chu Liu
Nan Du
Yanping Huang
Vincent Zhao
Andrew M. Dai
Zhifeng Chen
Quoc V. Le
James Laudon
MoE
308
369
0
18 Feb 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Samyam Rajbhandari
Conglong Li
Z. Yao
Minjia Zhang
Reza Yazdani Aminabadi
A. A. Awan
Jeff Rasley
Yuxiong He
117
306
0
14 Jan 2022
Efficient Large Scale Language Modeling with Mixtures of Experts
Mikel Artetxe
Shruti Bhosale
Naman Goyal
Todor Mihaylov
Myle Ott
...
Jeff Wang
Luke Zettlemoyer
Mona T. Diab
Zornitsa Kozareva
Ves Stoyanov
MoE
210
198
0
20 Dec 2021
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Nan Du
Yanping Huang
Andrew M. Dai
Simon Tong
Dmitry Lepikhin
...
Kun Zhang
Quoc V. Le
Yonghui Wu
Zhiwen Chen
Claire Cui
ALM
MoE
244
829
0
13 Dec 2021
Hash Layers For Large Sparse Models
Stephen Roller
Sainbayar Sukhbaatar
Arthur Szlam
Jason Weston
MoE
181
216
0
08 Jun 2021
Carbon Emissions and Large Neural Network Training
David A. Patterson
Joseph E. Gonzalez
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
339
682
0
21 Apr 2021
BASE Layers: Simplifying Training of Large, Sparse Models
M. Lewis
Shruti Bhosale
Tim Dettmers
Naman Goyal
Luke Zettlemoyer
MoE
206
283
0
30 Mar 2021
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin
HyoukJoong Lee
Yuanzhong Xu
Dehao Chen
Orhan Firat
Yanping Huang
M. Krikun
Noam M. Shazeer
Zhiwen Chen
MoE
135
1,194
0
30 Jun 2020
FastBERT: a Self-distilling BERT with Adaptive Inference Time
Weijie Liu
Peng Zhou
Zhe Zhao
Zhiruo Wang
Haotang Deng
Qi Ju
95
360
0
05 Apr 2020
Reducing Transformer Depth on Demand with Structured Dropout
Angela Fan
Edouard Grave
Armand Joulin
120
596
0
25 Sep 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
343
1,920
0
17 Sep 2019
Batch DropBlock Network for Person Re-identification and Beyond
Zuozhuo Dai
Mingqiang Chen
Xiaodong Gu
Siyu Zhu
Ping Tan
OOD
74
247
0
17 Nov 2018
DropBlock: A regularization method for convolutional networks
Golnaz Ghiasi
Nayeon Lee
Quoc V. Le
118
916
0
30 Oct 2018
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
209
3,534
0
19 Aug 2018
SkipNet: Learning Dynamic Routing in Convolutional Networks
Xin Wang
Feng Yu
Zi-Yi Dou
Trevor Darrell
Joseph E. Gonzalez
112
640
0
26 Nov 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit
N. Jouppi
C. Young
Nishant Patil
David Patterson
Gaurav Agrawal
...
Vijay Vasudevan
Richard Walter
Walter Wang
Eric Wilcox
Doe Hyun Yoon
239
4,644
0
16 Apr 2017
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio
Nicholas Léonard
Aaron Courville
398
3,158
0
15 Aug 2013
1