v1v2 (latest)

Simplifying Transformer Blocks

3 November 2023

Papers citing "Simplifying Transformer Blocks"

32 / 32 papers shown

Title
Do Large Language Models (Really) Need Statistical Foundations? Weijie Su 259 0 0 25 May 2025
Attention layers provably solve single-location regression Pierre Marion Raphael Berthier Gérard Biau Claire Boyer 426 5 0 02 Oct 2024
Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer Mingxuan Liu Jiankai Tang Haoxiang Li Jiahao Qi Siwei Li Kegang Wang Yuntao wang Hong Chen Yuntao Wang Hong Chen 146 15 0 07 Feb 2024
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit Lorenzo Noci Chuning Li Mufan Li Bobby He Thomas Hofmann Chris J. Maddison Daniel M. Roy 85 35 0 30 Jun 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Bobby He James Martens Guodong Zhang Aleksandar Botev Andy Brock Samuel L. Smith Yee Whye Teh 74 30 0 20 Feb 2023
Width and Depth Limits Commute in Residual Networks Soufiane Hayou Greg Yang 78 14 0 01 Feb 2023
Pre-training via Denoising for Molecular Property Prediction Sheheryar Zaidi Michael Schaarschmidt James Martens Hyunjik Kim Yee Whye Teh Alvaro Sanchez-Gonzalez Peter W. Battaglia Razvan Pascanu Jonathan Godwin DiffM AI4CE 100 127 0 31 May 2022
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra ... Kathy Meier-Hellstern Douglas Eck J. Dean Slav Petrov Noah Fiedel PILM LRM 500 6,279 0 05 Apr 2022
Training Compute-Optimal Large Language Models Jordan Hoffmann Sebastian Borgeaud A. Mensch Elena Buchatskaya Trevor Cai ... Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals Laurent Sifre AI4TS 208 1,949 0 29 Mar 2022
Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers Guodong Zhang Aleksandar Botev James Martens OffRL 73 27 0 15 Mar 2022
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice Peihao Wang Wenqing Zheng Tianlong Chen Zhangyang Wang ViT 76 137 0 09 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers Hongyu Wang Shuming Ma Li Dong Shaohan Huang Dongdong Zhang Furu Wei MoE AI4CE 126 162 0 01 Mar 2022
TrimBERT: Tailoring BERT for Trade-offs S. N. Sridhar Anthony Sarah Sairam Sundaresan MQ 60 4 0 24 Feb 2022
Going deeper with Image Transformers Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Hervé Jégou ViT 157 1,014 0 31 Mar 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth Yihe Dong Jean-Baptiste Cordonnier Andreas Loukas 132 386 0 05 Mar 2021
Linear Transformers Are Secretly Fast Weight Programmers Imanol Schlag Kazuki Irie Jürgen Schmidhuber 124 250 0 22 Feb 2021
High-Performance Large-Scale Image Recognition Without Normalization Andrew Brock Soham De Samuel L. Smith Karen Simonyan VLM 282 520 0 11 Feb 2021
RepVGG: Making VGG-style ConvNets Great Again Xiaohan Ding Xinming Zhang Ningning Ma Jungong Han Guiguang Ding Jian Sun 284 1,599 0 11 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 450 2,113 0 31 Dec 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulos Apoorv Vyas Nikolaos Pappas Franccois Fleuret 201 1,771 0 29 Jun 2020
ReZero is All You Need: Fast Convergence at Large Depth Thomas C. Bachlechner Bodhisattwa Prasad Majumder H. H. Mao G. Cottrell Julian McAuley AI4CE 80 281 0 10 Mar 2020
On Layer Normalization in the Transformer Architecture Ruibin Xiong Yunchang Yang Di He Kai Zheng Shuxin Zheng Chen Xing Huishuai Zhang Yanyan Lan Liwei Wang Tie-Yan Liu AI4CE 139 993 0 12 Feb 2020
Augmenting Self-attention with Persistent Memory Sainbayar Sukhbaatar Edouard Grave Guillaume Lample Hervé Jégou Armand Joulin RALM KELM 73 139 0 02 Jul 2019
How to Initialize your Network? Robust Initialization for WeightNorm & ResNets Devansh Arpit Victor Campos Yoshua Bengio 52 59 0 05 Jun 2019
On the Impact of the Activation Function on Deep Neural Networks Training Soufiane Hayou Arnaud Doucet Judith Rousseau ODL 65 199 0 19 Feb 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 1.1K 7,182 0 20 Apr 2018
Deep Neural Networks as Gaussian Processes Jaehoon Lee Yasaman Bahri Roman Novak S. Schoenholz Jeffrey Pennington Jascha Narain Sohl-Dickstein UQCV BDL 131 1,097 0 01 Nov 2017
The Shattered Gradients Problem: If resnets are the answer, then what is the question? David Balduzzi Marcus Frean Lennox Leary J. P. Lewis Kurt Wan-Duo Ma Brian McWilliams ODL 71 403 0 28 Feb 2017
Language Modeling with Gated Convolutional Networks Yann N. Dauphin Angela Fan Michael Auli David Grangier 240 2,400 0 23 Dec 2016
Layer Normalization Jimmy Lei Ba J. Kiros Geoffrey E. Hinton 413 10,494 0 21 Jul 2016
Exponential expressivity in deep neural networks through transient chaos Ben Poole Subhaneil Lahiri M. Raghu Jascha Narain Sohl-Dickstein Surya Ganguli 90 592 0 16 Jun 2016
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Andrew M. Saxe James L. McClelland Surya Ganguli ODL 178 1,849 0 20 Dec 2013