Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.10322
Cited By
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
20 February 2023
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation"
17 / 17 papers shown
Title
Always Skip Attention
Yiping Ji
Hemanth Saratchandran
Peyman Moghaddam
Simon Lucey
160
0
0
04 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
141
0
0
26 Apr 2025
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
94
4
0
17 Jan 2025
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Federico Arangath Joseph
Jerome Sieber
M. Zeilinger
Carmen Amo Alonso
33
0
0
14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
N. Jha
Brandon Reagen
OffRL
AI4CE
33
0
0
12 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Xinhao Yao
Hongjin Qian
Xiaolin Hu
Gengze Xu
Wei Liu
Jian Luan
Bin Wang
Yong-Jin Liu
48
0
0
03 Oct 2024
Attention layers provably solve single-location regression
P. Marion
Raphael Berthier
Gérard Biau
Claire Boyer
149
2
0
02 Oct 2024
The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
36
11
0
12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
39
3
0
29 May 2024
Transformer tricks: Removing weights for skipless transformers
Nils Graef
35
2
0
18 Apr 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
39
26
0
29 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
37
141
0
19 Feb 2024
Mimetic Initialization of Self-Attention Layers
Asher Trockman
J. Zico Kolter
30
30
0
16 May 2023
Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
James Martens
Andy Ballard
Guillaume Desjardins
G. Swirszcz
Valentin Dalibard
Jascha Narain Sohl-Dickstein
S. Schoenholz
88
43
0
05 Oct 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
253
695
0
27 Aug 2021
Stable ResNet
Soufiane Hayou
Eugenio Clerico
Bo He
George Deligiannidis
Arnaud Doucet
Judith Rousseau
ODL
SSeg
46
51
0
24 Oct 2020
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks
Lechao Xiao
Yasaman Bahri
Jascha Narain Sohl-Dickstein
S. Schoenholz
Jeffrey Pennington
233
348
0
14 Jun 2018
1