Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.01906
Cited By
v1
v2 (latest)
Simplifying Transformer Blocks
3 November 2023
Bobby He
Thomas Hofmann
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Simplifying Transformer Blocks"
32 / 32 papers shown
Title
Do Large Language Models (Really) Need Statistical Foundations?
Weijie Su
259
0
0
25 May 2025
Attention layers provably solve single-location regression
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
426
5
0
02 Oct 2024
Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer
Mingxuan Liu
Jiankai Tang
Haoxiang Li
Jiahao Qi
Siwei Li
Kegang Wang
Yuntao wang
Hong Chen
Yuntao Wang
Hong Chen
146
15
0
07 Feb 2024
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci
Chuning Li
Mufan Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
85
35
0
30 Jun 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
74
30
0
20 Feb 2023
Width and Depth Limits Commute in Residual Networks
Soufiane Hayou
Greg Yang
78
14
0
01 Feb 2023
Pre-training via Denoising for Molecular Property Prediction
Sheheryar Zaidi
Michael Schaarschmidt
James Martens
Hyunjik Kim
Yee Whye Teh
Alvaro Sanchez-Gonzalez
Peter W. Battaglia
Razvan Pascanu
Jonathan Godwin
DiffM
AI4CE
100
127
0
31 May 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
500
6,279
0
05 Apr 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
208
1,949
0
29 Mar 2022
Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
Guodong Zhang
Aleksandar Botev
James Martens
OffRL
73
27
0
15 Mar 2022
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
Peihao Wang
Wenqing Zheng
Tianlong Chen
Zhangyang Wang
ViT
76
137
0
09 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
126
162
0
01 Mar 2022
TrimBERT: Tailoring BERT for Trade-offs
S. N. Sridhar
Anthony Sarah
Sairam Sundaresan
MQ
60
4
0
24 Feb 2022
Going deeper with Image Transformers
Hugo Touvron
Matthieu Cord
Alexandre Sablayrolles
Gabriel Synnaeve
Hervé Jégou
ViT
157
1,014
0
31 Mar 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
132
386
0
05 Mar 2021
Linear Transformers Are Secretly Fast Weight Programmers
Imanol Schlag
Kazuki Irie
Jürgen Schmidhuber
124
250
0
22 Feb 2021
High-Performance Large-Scale Image Recognition Without Normalization
Andrew Brock
Soham De
Samuel L. Smith
Karen Simonyan
VLM
282
520
0
11 Feb 2021
RepVGG: Making VGG-style ConvNets Great Again
Xiaohan Ding
Xinming Zhang
Ningning Ma
Jungong Han
Guiguang Ding
Jian Sun
284
1,599
0
11 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
450
2,113
0
31 Dec 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
201
1,771
0
29 Jun 2020
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
80
281
0
10 Mar 2020
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
139
993
0
12 Feb 2020
Augmenting Self-attention with Persistent Memory
Sainbayar Sukhbaatar
Edouard Grave
Guillaume Lample
Hervé Jégou
Armand Joulin
RALM
KELM
73
139
0
02 Jul 2019
How to Initialize your Network? Robust Initialization for WeightNorm & ResNets
Devansh Arpit
Victor Campos
Yoshua Bengio
52
59
0
05 Jun 2019
On the Impact of the Activation Function on Deep Neural Networks Training
Soufiane Hayou
Arnaud Doucet
Judith Rousseau
ODL
65
199
0
19 Feb 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
1.1K
7,182
0
20 Apr 2018
Deep Neural Networks as Gaussian Processes
Jaehoon Lee
Yasaman Bahri
Roman Novak
S. Schoenholz
Jeffrey Pennington
Jascha Narain Sohl-Dickstein
UQCV
BDL
131
1,097
0
01 Nov 2017
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
David Balduzzi
Marcus Frean
Lennox Leary
J. P. Lewis
Kurt Wan-Duo Ma
Brian McWilliams
ODL
71
403
0
28 Feb 2017
Language Modeling with Gated Convolutional Networks
Yann N. Dauphin
Angela Fan
Michael Auli
David Grangier
240
2,400
0
23 Dec 2016
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
413
10,494
0
21 Jul 2016
Exponential expressivity in deep neural networks through transient chaos
Ben Poole
Subhaneil Lahiri
M. Raghu
Jascha Narain Sohl-Dickstein
Surya Ganguli
90
592
0
16 Jun 2016
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Andrew M. Saxe
James L. McClelland
Surya Ganguli
ODL
178
1,849
0
20 Dec 2013
1