ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.12680
  4. Cited By
On the Optimization and Generalization of Multi-head Attention

On the Optimization and Generalization of Multi-head Attention

19 October 2023
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
    MLT
ArXivPDFHTML

Papers citing "On the Optimization and Generalization of Multi-head Attention"

8 / 8 papers shown
Title
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
46
0
0
02 May 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
28
0
0
17 Oct 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
29
13
0
08 Feb 2024
The Expressibility of Polynomial based Attention Scheme
The Expressibility of Polynomial based Attention Scheme
Zhao-quan Song
Guangyi Xu
Junze Yin
32
5
0
30 Oct 2023
Restricted Strong Convexity of Deep Learning Models with Smooth
  Activations
Restricted Strong Convexity of Deep Learning Models with Smooth Activations
A. Banerjee
Pedro Cisneros-Velarde
Libin Zhu
M. Belkin
26
7
0
29 Sep 2022
Stability and Generalization Analysis of Gradient Methods for Shallow
  Neural Networks
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks
Yunwen Lei
Rong Jin
Yiming Ying
MLT
37
18
0
19 Sep 2022
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer
  Neural Network
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network
Mo Zhou
Rong Ge
Chi Jin
71
44
0
04 Feb 2021
A Decomposable Attention Model for Natural Language Inference
A Decomposable Attention Model for Natural Language Inference
Ankur P. Parikh
Oscar Täckström
Dipanjan Das
Jakob Uszkoreit
201
1,367
0
06 Jun 2016
1