Learning to Skip for Language Modeling

26 November 2023

Papers citing "Learning to Skip for Language Modeling"

20 / 20 papers shown

Title
Adaptive Layer-skipping in Pre-trained LLMs Xuan Luo Weizhi Wang Xifeng Yan 454 1 0 31 Mar 2025
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA Sangmin Bae Adam Fisch Hrayr Harutyunyan Ziwei Ji Seungyeon Kim Tal Schuster KELM 133 7 0 28 Oct 2024
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes Peter Kocsis Peter Súkeník Guillem Brasó Matthias Nießner Laura Leal-Taixé Ismail Elezi 45 7 0 11 Oct 2022
Mixture-of-Experts with Expert Choice Routing Yan-Quan Zhou Tao Lei Han-Chu Liu Nan Du Yanping Huang Vincent Zhao Andrew M. Dai Zhifeng Chen Quoc V. Le James Laudon MoE 308 369 0 18 Feb 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale Samyam Rajbhandari Conglong Li Z. Yao Minjia Zhang Reza Yazdani Aminabadi A. A. Awan Jeff Rasley Yuxiong He 117 306 0 14 Jan 2022
Efficient Large Scale Language Modeling with Mixtures of Experts Mikel Artetxe Shruti Bhosale Naman Goyal Todor Mihaylov Myle Ott ... Jeff Wang Luke Zettlemoyer Mona T. Diab Zornitsa Kozareva Ves Stoyanov MoE 210 198 0 20 Dec 2021
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Nan Du Yanping Huang Andrew M. Dai Simon Tong Dmitry Lepikhin ... Kun Zhang Quoc V. Le Yonghui Wu Zhiwen Chen Claire Cui ALM MoE 244 829 0 13 Dec 2021
Hash Layers For Large Sparse Models Stephen Roller Sainbayar Sukhbaatar Arthur Szlam Jason Weston MoE 181 216 0 08 Jun 2021
Carbon Emissions and Large Neural Network Training David A. Patterson Joseph E. Gonzalez Quoc V. Le Chen Liang Lluís-Miquel Munguía D. Rothchild David R. So Maud Texier J. Dean AI4CE 339 682 0 21 Apr 2021
BASE Layers: Simplifying Training of Large, Sparse Models M. Lewis Shruti Bhosale Tim Dettmers Naman Goyal Luke Zettlemoyer MoE 206 283 0 30 Mar 2021
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Dmitry Lepikhin HyoukJoong Lee Yuanzhong Xu Dehao Chen Orhan Firat Yanping Huang M. Krikun Noam M. Shazeer Zhiwen Chen MoE 135 1,194 0 30 Jun 2020
FastBERT: a Self-distilling BERT with Adaptive Inference Time Weijie Liu Peng Zhou Zhe Zhao Zhiruo Wang Haotang Deng Qi Ju 95 360 0 05 Apr 2020
Reducing Transformer Depth on Demand with Structured Dropout Angela Fan Edouard Grave Armand Joulin 120 596 0 25 Sep 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Mohammad Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 343 1,920 0 17 Sep 2019
Batch DropBlock Network for Person Re-identification and Beyond Zuozhuo Dai Mingqiang Chen Xiaodong Gu Siyu Zhu Ping Tan OOD 74 247 0 17 Nov 2018
DropBlock: A regularization method for convolutional networks Golnaz Ghiasi Nayeon Lee Quoc V. Le 118 916 0 30 Oct 2018
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Taku Kudo John Richardson 209 3,534 0 19 Aug 2018
SkipNet: Learning Dynamic Routing in Convolutional Networks Xin Wang Feng Yu Zi-Yi Dou Trevor Darrell Joseph E. Gonzalez 112 640 0 26 Nov 2017
In-Datacenter Performance Analysis of a Tensor Processing Unit N. Jouppi C. Young Nishant Patil David Patterson Gaurav Agrawal ... Vijay Vasudevan Richard Walter Walter Wang Eric Wilcox Doe Hyun Yoon 239 4,644 0 16 Apr 2017
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Yoshua Bengio Nicholas Léonard Aaron Courville 398 3,158 0 15 Aug 2013