ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.17759
  4. Cited By
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
  Limit

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

30 June 2023
Lorenzo Noci
Chuning Li
Mufan Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
ArXivPDFHTML

Papers citing "The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit"

33 / 33 papers shown
Title
Attention-based clustering
Attention-based clustering
Rodrigo Maulen-Soto
Claire Boyer
Pierre Marion
4
0
0
19 May 2025
AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections
AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections
Xin Yu
Yujia Wang
Jinghui Chen
Lingzhou Xue
17
0
0
18 May 2025
Always Skip Attention
Always Skip Attention
Yiping Ji
Hemanth Saratchandran
Peyman Moghaddam
Simon Lucey
205
0
0
04 May 2025
Don't be lazy: CompleteP enables compute-efficient deep transformers
Don't be lazy: CompleteP enables compute-efficient deep transformers
Nolan Dey
Bin Claire Zhang
Lorenzo Noci
Mufan Li
Blake Bordelon
Shane Bergsma
Cengiz Pehlevan
Boris Hanin
Joel Hestness
44
1
0
02 May 2025
Deep Neural Nets as Hamiltonians
Deep Neural Nets as Hamiltonians
Mike Winer
Boris Hanin
187
0
0
31 Mar 2025
Generalized Probabilistic Attention Mechanism in Transformers
Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo
Heeyoul Choi
56
0
0
21 Oct 2024
AERO: Softmax-Only LLMs for Efficient Private Inference
AERO: Softmax-Only LLMs for Efficient Private Inference
N. Jha
Brandon Reagen
32
1
0
16 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large
  Language Models
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
N. Jha
Brandon Reagen
OffRL
AI4CE
33
0
0
12 Oct 2024
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum
  Properties for Improved ADMET Modeling
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling
Alessio Fallani
Ramil I. Nugmanov
Jose A. Arjona-Medina
Jörg Kurt Wegner
Alexandre Tkatchenko
Kostiantyn Chernichenko
MedIm
AI4CE
39
0
0
10 Oct 2024
A Generalization Bound for Nearly-Linear Networks
A Generalization Bound for Nearly-Linear Networks
Eugene Golikov
31
0
0
09 Jul 2024
Federated Learning with Flexible Architectures
Federated Learning with Flexible Architectures
Jong-Ik Park
Carlee Joe-Wong
FedML
45
3
0
14 Jun 2024
The Impact of Initialization on LoRA Finetuning Dynamics
The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
36
13
0
12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
42
3
0
29 May 2024
Dissecting the Interplay of Attention Paths in a Statistical Mechanics
  Theory of Transformers
Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers
Lorenzo Tiberi
Francesca Mignacco
Kazuki Irie
H. Sompolinsky
44
6
0
24 May 2024
Infinite Limits of Multi-head Transformer Dynamics
Infinite Limits of Multi-head Transformer Dynamics
Blake Bordelon
Hamza Tahir Chaudhry
Cengiz Pehlevan
AI4CE
51
9
0
24 May 2024
Geometric Dynamics of Signal Propagation Predict Trainability of
  Transformers
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Aditya Cowsik
Tamra M. Nebabu
Xiao-Liang Qi
Surya Ganguli
28
9
0
05 Mar 2024
LoRA+: Efficient Low Rank Adaptation of Large Models
LoRA+: Efficient Low Rank Adaptation of Large Models
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
46
148
0
19 Feb 2024
Attention with Markov: A Framework for Principled Analysis of
  Transformers via Markov Chains
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Ashok Vardhan Makkuva
Marco Bondaschi
Adway Girish
Alliot Nagle
Martin Jaggi
Hyeji Kim
Michael C. Gastpar
OffRL
18
25
0
06 Feb 2024
A convergence result of a continuous model of deep learning via
  Łojasiewicz--Simon inequality
A convergence result of a continuous model of deep learning via Łojasiewicz--Simon inequality
Noboru Isobe
21
2
0
26 Nov 2023
Simplifying Transformer Blocks
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
27
31
0
03 Nov 2023
Differential Equation Scaling Limits of Shaped and Unshaped Neural
  Networks
Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks
Mufan Li
Mihai Nica
28
2
0
18 Oct 2023
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Greg Yang
Dingli Yu
Chen Zhu
Soufiane Hayou
MLT
16
27
0
03 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without
  Gradient Explosion
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Alexandru Meterez
Amir Joudaki
Francesco Orabona
Alexander Immer
Gunnar Rätsch
Hadi Daneshmand
34
8
0
03 Oct 2023
Commutative Width and Depth Scaling in Deep Neural Networks
Commutative Width and Depth Scaling in Deep Neural Networks
Soufiane Hayou
49
2
0
02 Oct 2023
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and
  Scaling Limit
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
Blake Bordelon
Lorenzo Noci
Mufan Li
Boris Hanin
Cengiz Pehlevan
35
22
0
28 Sep 2023
Transformers as Support Vector Machines
Transformers as Support Vector Machines
Davoud Ataee Tarzanagh
Yingcong Li
Christos Thrampoulidis
Samet Oymak
48
43
0
31 Aug 2023
Stabilizing Transformer Training by Preventing Attention Entropy
  Collapse
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
46
68
0
11 Mar 2023
SGD learning on neural networks: leap complexity and saddle-to-saddle
  dynamics
SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics
Emmanuel Abbe
Enric Boix-Adserà
Theodor Misiakiewicz
FedML
MLT
81
73
0
21 Feb 2023
Neural Networks Efficiently Learn Low-Dimensional Representations with
  SGD
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD
Alireza Mousavi-Hosseini
Sejun Park
M. Girotti
Ioannis Mitliagkas
Murat A. Erdogdu
MLT
324
48
0
29 Sep 2022
Rapid training of deep neural networks without skip connections or
  normalization layers using Deep Kernel Shaping
Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
James Martens
Andy Ballard
Guillaume Desjardins
G. Swirszcz
Valentin Dalibard
Jascha Narain Sohl-Dickstein
S. Schoenholz
88
43
0
05 Oct 2021
Stable ResNet
Stable ResNet
Soufiane Hayou
Eugenio Clerico
Bo He
George Deligiannidis
Arnaud Doucet
Judith Rousseau
ODL
SSeg
46
51
0
24 Oct 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,505
0
23 Jan 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,984
0
20 Apr 2018
1