Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.01107
Cited By
A Survey on Efficient Training of Transformers
2 February 2023
Bohan Zhuang
Jing Liu
Zizheng Pan
Haoyu He
Yuetian Weng
Chunhua Shen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Survey on Efficient Training of Transformers"
50 / 84 papers shown
Title
Does Self-Attention Need Separate Weights in Transformers?
Md. Kowsher
Nusrat Jahan Prottasha
Chun-Nam Yu
O. Garibay
Niloofar Yousefi
440
0
0
30 Nov 2024
Continuous Speech Tokenizer in Text To Speech
Yixing Li
Ruobing Xie
Xingwu Sun
Yu Cheng
Zhanhui Kang
AuLLM
CLL
81
2
0
22 Oct 2024
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers
Tobias Christian Nauen
Sebastián M. Palacio
Federico Raue
Andreas Dengel
81
4
0
18 Aug 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
995
12,840
0
27 Feb 2023
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
125
367
0
13 Feb 2023
Scaling Language-Image Pre-training via Masking
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
63
321
0
01 Dec 2022
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design
Haoran You
Zhanyi Sun
Huihong Shi
Zhongzhi Yu
Yang Zhao
Yongan Zhang
Chaojian Li
Baopu Li
Yingyan Lin
ViT
60
82
0
18 Oct 2022
Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
Jean Kaddour
MoMe
3DH
47
41
0
29 Sep 2022
Beyond neural scaling laws: beating power law scaling via data pruning
Ben Sorscher
Robert Geirhos
Shashank Shekhar
Surya Ganguli
Ari S. Morcos
71
436
0
29 Jun 2022
GACT: Activation Compressed Training for Generic Network Architectures
Xiaoxuan Liu
Lianmin Zheng
Dequan Wang
Yukuo Cen
Weize Chen
...
Zhiyuan Liu
Jie Tang
Joey Gonzalez
Michael W. Mahoney
Alvin Cheung
VLM
GNN
MQ
52
32
0
22 Jun 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
183
2,131
0
27 May 2022
Sharpness-Aware Training for Free
Jiawei Du
Daquan Zhou
Jiashi Feng
Vincent Y. F. Tan
Qiufeng Wang
AAML
61
94
0
27 May 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
146
1,915
0
29 Mar 2022
Automated Progressive Learning for Efficient Training of Vision Transformers
Changlin Li
Bohan Zhuang
Guangrun Wang
Xiaodan Liang
Xiaojun Chang
Yi Yang
67
46
0
28 Mar 2022
Visual Prompt Tuning
Menglin Jia
Luming Tang
Bor-Chun Chen
Claire Cardie
Serge Belongie
Bharath Hariharan
Ser-Nam Lim
VLM
VPVLM
126
1,576
0
23 Mar 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
Mohammad Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
141
737
0
28 Jan 2022
Mesa: A Memory-saving Training Framework for Transformers
Zizheng Pan
Peng Chen
Haoyu He
Jing Liu
Jianfei Cai
Bohan Zhuang
54
20
0
22 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
388
7,600
0
11 Nov 2021
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Yongbin Li
Hongxin Liu
Zhengda Bian
Boxiang Wang
Haichen Huang
Fan Cui
Chuan-Qing Wang
Yang You
GNN
53
144
0
28 Oct 2021
Efficient Sharpness-aware Minimization for Improved Training of Neural Networks
Jiawei Du
Hanshu Yan
Jiashi Feng
Qiufeng Wang
Liangli Zhen
Rick Siow Mong Goh
Vincent Y. F. Tan
AAML
128
134
0
07 Oct 2021
Deep Learning on a Data Diet: Finding Important Examples Early in Training
Mansheej Paul
Surya Ganguli
Gintare Karolina Dziugaite
100
446
0
15 Jul 2021
AutoFormer: Searching Transformers for Visual Recognition
Minghao Chen
Houwen Peng
Jianlong Fu
Haibin Ling
ViT
76
262
0
01 Jul 2021
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
Elad Ben-Zaken
Shauli Ravfogel
Yoav Goldberg
151
1,191
0
18 Jun 2021
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
312
10,099
0
17 Jun 2021
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao
Li Dong
Songhao Piao
Furu Wei
ViT
195
2,790
0
15 Jun 2021
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
Tianlong Chen
Yu Cheng
Zhe Gan
Lu Yuan
Lei Zhang
Zhangyang Wang
ViT
46
216
0
08 Jun 2021
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Xiangning Chen
Cho-Jui Hsieh
Boqing Gong
ViT
79
324
0
03 Jun 2021
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
Jianfei Chen
Lianmin Zheng
Z. Yao
Dequan Wang
Ion Stoica
Michael W. Mahoney
Joseph E. Gonzalez
MQ
56
74
0
29 Apr 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
480
3,952
0
18 Apr 2021
Accelerating Sparse Deep Neural Networks
Asit K. Mishra
J. Latorre
Jeff Pool
Darko Stosic
Dusan Stosic
Ganesh Venkatesh
Chong Yu
Paulius Micikevicius
115
228
0
16 Apr 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan
Mohammad Shoeybi
Jared Casper
P. LeGresley
M. Patwary
...
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
MoE
74
667
0
09 Apr 2021
Going deeper with Image Transformers
Hugo Touvron
Matthieu Cord
Alexandre Sablayrolles
Gabriel Synnaeve
Hervé Jégou
ViT
125
998
0
31 Mar 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Stéphane dÁscoli
Hugo Touvron
Matthew L. Leavitt
Ari S. Morcos
Giulio Biroli
Levent Sagun
ViT
104
818
0
19 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
767
28,659
0
26 Feb 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
242
424
0
18 Jan 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
70
2,136
0
11 Jan 2021
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets
Xiaohan Chen
Yu Cheng
Shuohang Wang
Zhe Gan
Zhangyang Wang
Jingjing Liu
73
100
0
31 Dec 2020
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
333
6,657
0
23 Dec 2020
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan
Luke Zettlemoyer
Sonal Gupta
84
549
1
22 Dec 2020
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Minjia Zhang
Yuxiong He
AI4CE
39
101
0
26 Oct 2020
On the Transformer Growth for Progressive BERT Training
Xiaotao Gu
Liyuan Liu
Hongkun Yu
Jing Li
Chong Chen
Jiawei Han
VLM
85
51
0
23 Oct 2020
Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret
Ariel Kleiner
H. Mobahi
Behnam Neyshabur
AAML
176
1,323
0
03 Oct 2020
MCUNet: Tiny Deep Learning on IoT Devices
Ji Lin
Wei-Ming Chen
Chengyue Wu
J. Cohn
Chuang Gan
Song Han
130
485
0
20 Jul 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
548
41,106
0
28 May 2020
torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models
Chiheon Kim
Heungsub Lee
Myungryong Jeong
Woonhyuk Baek
Boogeon Yoon
Ildoo Kim
Sungbin Lim
Sungwoong Kim
MoE
AI4CE
42
53
0
21 Apr 2020
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
61
277
0
10 Mar 2020
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Zhuohan Li
Eric Wallace
Sheng Shen
Kevin Lin
Kurt Keutzer
Dan Klein
Joseph E. Gonzalez
85
150
0
26 Feb 2020
A
3
^3
3
: Accelerating Attention Mechanisms in Neural Networks with Approximation
Tae Jun Ham
Sungjun Jung
Seonghak Kim
Young H. Oh
Yeonhong Park
...
Jung-Hun Park
Sanghee Lee
Kyoung Park
Jae W. Lee
D. Jeong
50
217
0
22 Feb 2020
Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks
Léopold Cambier
Anahita Bhiwandiwalla
Ting Gong
M. Nekuii
Oguz H. Elibol
Hanlin Tang
MQ
85
48
0
16 Jan 2020
Why are Adaptive Methods Good for Attention Models?
J.N. Zhang
Sai Praneeth Karimireddy
Andreas Veit
Seungyeon Kim
Sashank J. Reddi
Surinder Kumar
S. Sra
79
80
0
06 Dec 2019
1
2
Next