Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

20 February 2023

Aleksandar Botev

Samuel L. Smith

Papers citing "Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation"

17 / 17 papers shown

Title
Always Skip Attention Yiping Ji Hemanth Saratchandran Peyman Moghaddam Simon Lucey 160 0 0 04 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity Ruifeng Ren Yong Liu 141 0 0 26 Apr 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 94 4 0 17 Jan 2025
Lambda-Skip Connections: the architectural component that prevents Rank Collapse Federico Arangath Joseph Jerome Sieber M. Zeilinger Carmen Amo Alonso 33 0 0 14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models N. Jha Brandon Reagen OffRL AI4CE 33 0 0 12 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization Xinhao Yao Hongjin Qian Xiaolin Hu Gengze Xu Wei Liu Jian Luan Bin Wang Yong-Jin Liu 48 0 0 03 Oct 2024
Attention layers provably solve single-location regression P. Marion Raphael Berthier Gérard Biau Claire Boyer 149 2 0 02 Oct 2024
The Impact of Initialization on LoRA Finetuning Dynamics Soufiane Hayou Nikhil Ghosh Bin Yu AI4CE 36 11 0 12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training Bobby He Lorenzo Noci Daniele Paliotta Imanol Schlag Thomas Hofmann 39 3 0 29 May 2024
Transformer tricks: Removing weights for skipless transformers Nils Graef 35 2 0 18 Apr 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 39 26 0 29 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models Soufiane Hayou Nikhil Ghosh Bin Yu AI4CE 37 141 0 19 Feb 2024
Mimetic Initialization of Self-Attention Layers Asher Trockman J. Zico Kolter 30 30 0 16 May 2023
Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping James Martens Andy Ballard Guillaume Desjardins G. Swirszcz Valentin Dalibard Jascha Narain Sohl-Dickstein S. Schoenholz 88 43 0 05 Oct 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press Noah A. Smith M. Lewis 253 695 0 27 Aug 2021
Stable ResNet Soufiane Hayou Eugenio Clerico Bo He George Deligiannidis Arnaud Doucet Judith Rousseau ODL SSeg 46 51 0 24 Oct 2020
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks Lechao Xiao Yasaman Bahri Jascha Narain Sohl-Dickstein S. Schoenholz Jeffrey Pennington 233 348 0 14 Jun 2018