Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.08249
Cited By
Understanding the Difficulty of Training Transformers
17 April 2020
Liyuan Liu
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
Jiawei Han
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding the Difficulty of Training Transformers"
50 / 63 papers shown
Title
A multilevel approach to accelerate the training of Transformers
Guillaume Lauga
Maël Chaumette
Edgar Desainte-Maréville
Étienne Lasalle
Arthur Lebeurrier
AI4CE
45
0
0
24 Apr 2025
DERD-Net: Learning Depth from Event-based Ray Densities
Diego de Oliveira Hitzges
Suman Ghosh
Guillermo Gallego
3DV
MDE
33
1
0
22 Apr 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo
Yutao Zeng
Ya Wang
Sijun Zhang
Jian Yang
Xiaoqing Li
Xun Zhou
Jinwen Ma
51
0
0
06 Mar 2025
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Da Xiao
Qingye Meng
Shengping Li
Xingyuan Yuan
MoE
AI4CE
71
1
0
13 Feb 2025
The Curse of Depth in Large Language Models
Wenfang Sun
Xinyuan Song
Pengxiang Li
Lu Yin
Yefeng Zheng
Shiwei Liu
75
5
0
09 Feb 2025
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Yuan Feng
Junlin Lv
Yuhang Cao
Xike Xie
S.Kevin Zhou
84
2
0
06 Feb 2025
Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures
Gabriel Lindenmaier
Sean Papay
Sebastian Padó
67
0
0
02 Feb 2025
More Expressive Attention with Negative Weights
Ang Lv
Ruobing Xie
Shuaipeng Li
Jiayi Liao
Xingchen Sun
Zhanhui Kang
Di Wang
Rui Yan
42
0
0
11 Nov 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
41
7
0
14 Oct 2024
Hyper-Connections
Defa Zhu
Hongzhi Huang
Zihao Huang
Yutao Zeng
Yunyao Mao
Banggu Wu
Qiyang Min
Xun Zhou
41
4
0
29 Sep 2024
Sampling Foundational Transformer: A Theoretical Perspective
Viet Anh Nguyen
Minh Lenhat
Khoa Nguyen
Duong Duc Hieu
Dao Huu Hung
Truong-Son Hy
48
0
0
11 Aug 2024
Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme
Johnny Jingze Li
V. George
Gabriel A. Silva
ODL
44
0
0
26 Jul 2024
Dynamic Anisotropic Smoothing for Noisy Derivative-Free Optimization
S. Reifenstein
T. Leleu
Yoshihisa Yamamoto
52
1
0
02 May 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
49
26
0
29 Feb 2024
Principled Architecture-aware Scaling of Hyperparameters
Wuyang Chen
Junru Wu
Zhangyang Wang
Boris Hanin
AI4CE
49
0
0
27 Feb 2024
Zero-Shot Reinforcement Learning via Function Encoders
Tyler Ingebrand
Amy Zhang
Ufuk Topcu
OffRL
45
3
0
30 Jan 2024
Right, No Matter Why: AI Fact-checking and AI Authority in Health-related Inquiry Settings
Elena Sergeeva
Anastasia Sergeeva
Huiyun Tang
Kerstin Bongard-Blanchy
Peter Szolovits
27
1
0
22 Oct 2023
Transformers in Reinforcement Learning: A Survey
Pranav Agarwal
A. Rahman
P. St-Charles
Simon J. D. Prince
Samira Ebrahimi Kahou
OffRL
35
19
0
12 Jul 2023
Centered Self-Attention Layers
Ameen Ali
Tomer Galanti
Lior Wolf
51
6
0
02 Jun 2023
Fine-Tuning Language Models with Just Forward Passes
Sadhika Malladi
Tianyu Gao
Eshaan Nichani
Alexandru Damian
Jason D. Lee
Danqi Chen
Sanjeev Arora
43
180
0
27 May 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Percy Liang
Tengyu Ma
VLM
57
132
0
23 May 2023
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
Junchi Yang
Xiang Li
Ilyas Fatkhullin
Niao He
47
15
0
21 May 2023
Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
Ye Lin
Shuhan Zhou
Yanyang Li
Anxiang Ma
Tong Xiao
Jingbo Zhu
38
0
0
10 May 2023
Convex Dual Theory Analysis of Two-Layer Convolutional Neural Networks with Soft-Thresholding
Chunyan Xiong
Meng Lu
Xiaotong Yu
JIAN-PENG Cao
Zhong Chen
D. Guo
X. Qu
MLT
43
0
0
14 Apr 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
46
47
0
21 Mar 2023
Bayesian Networks for Named Entity Prediction in Programming Community Question Answering
Alexey Gorbatovski
Sergey Kovalchuk
19
2
0
26 Feb 2023
Mnemosyne: Learning to Train Transformers with Transformers
Deepali Jain
K. Choromanski
Kumar Avinava Dubey
Sumeet Singh
Vikas Sindhwani
Tingnan Zhang
Jie Tan
OffRL
44
9
0
02 Feb 2023
Efficient Long Sequence Modeling via State Space Augmented Transformer
Simiao Zuo
Xiaodong Liu
Jian Jiao
Denis Xavier Charles
Eren Manavoglu
Tuo Zhao
Jianfeng Gao
130
36
0
15 Dec 2022
ToDD: Topological Compound Fingerprinting in Computer-Aided Drug Discovery
Andac Demir
Baris Coskunuzer
I. Segovia-Dominguez
Yuzhou Chen
Yulia R. Gel
B. Kiziltan
20
16
0
07 Nov 2022
MetaFormer Baselines for Vision
Weihao Yu
Chenyang Si
Pan Zhou
Mi Luo
Yichen Zhou
Jiashi Feng
Shuicheng Yan
Xinchao Wang
MoE
40
158
0
24 Oct 2022
A Kernel-Based View of Language Model Fine-Tuning
Sadhika Malladi
Alexander Wettig
Dingli Yu
Danqi Chen
Sanjeev Arora
VLM
78
63
0
11 Oct 2022
Born for Auto-Tagging: Faster and better with new objective functions
Chiung-ju Liu
Huang-Ting Shieh
30
1
0
15 Jun 2022
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
Payal Bajaj
Chenyan Xiong
Guolin Ke
Xiaodong Liu
Di He
Saurabh Tiwary
Tie-Yan Liu
Paul N. Bennett
Xia Song
Jianfeng Gao
52
32
0
13 Apr 2022
Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results
T. Ridnik
Hussam Lawen
Emanuel Ben-Baruch
Asaf Noy
45
11
0
07 Apr 2022
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Xiaohan Ding
Xinming Zhang
Yi Zhou
Jungong Han
Guiguang Ding
Jian Sun
VLM
49
529
0
13 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
31
149
0
07 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
36
157
0
01 Mar 2022
ActionFormer: Localizing Moments of Actions with Transformers
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
31
333
0
16 Feb 2022
Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Yaoqing Yang
Ryan Theisen
Liam Hodgkinson
Joseph E. Gonzalez
Kannan Ramchandran
Charles H. Martin
Michael W. Mahoney
94
17
0
06 Feb 2022
Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li
Srinadh Bhojanapalli
Manzil Zaheer
Sashank J. Reddi
Surinder Kumar
29
27
0
02 Feb 2022
Are Transformers More Robust Than CNNs?
Yutong Bai
Jieru Mei
Alan Yuille
Cihang Xie
ViT
AAML
195
258
0
10 Nov 2021
NormFormer: Improved Transformer Pretraining with Extra Normalization
Sam Shleifer
Jason Weston
Myle Ott
AI4CE
33
74
0
18 Oct 2021
A Loss Curvature Perspective on Training Instability in Deep Learning
Justin Gilmer
Behrooz Ghorbani
Ankush Garg
Sneha Kudugunta
Behnam Neyshabur
David E. Cardoze
George E. Dahl
Zachary Nado
Orhan Firat
ODL
36
35
0
08 Oct 2021
Taming Sparsely Activated Transformer with Stochastic Experts
Simiao Zuo
Xiaodong Liu
Jian Jiao
Young Jin Kim
Hany Hassan
Ruofei Zhang
T. Zhao
Jianfeng Gao
MoE
44
109
0
08 Oct 2021
Puzzle Solving without Search or Human Knowledge: An Unnatural Language Approach
David Noever
Ryerson Burdick
ReLM
176
7
0
07 Sep 2021
Is attention to bounding boxes all you need for pedestrian action prediction?
Lina Achaji
Julien Moreau
Thibault Fouqueray
François Aioun
François Charpillet
23
30
0
16 Jul 2021
Stabilizing Equilibrium Models by Jacobian Regularization
Shaojie Bai
V. Koltun
J. Zico Kolter
33
57
0
28 Jun 2021
Revisiting Deep Learning Models for Tabular Data
Yu. V. Gorishniy
Ivan Rubachev
Valentin Khrulkov
Artem Babenko
LMTD
48
703
0
22 Jun 2021
Multi-head or Single-head? An Empirical Comparison for Transformer Training
Liyuan Liu
Jialu Liu
Jiawei Han
23
32
0
17 Jun 2021
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
53
1,089
0
08 Jun 2021
1
2
Next