On the Optimization and Generalization of Multi-head Attention

19 October 2023

Papers citing "On the Optimization and Generalization of Multi-head Attention"

8 / 8 papers shown

Title
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias Ruiquan Huang Yingbin Liang Jing Yang 46 0 0 02 May 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery Renpu Liu Ruida Zhou Cong Shen Jing Yang 28 0 0 17 Oct 2024
Implicit Bias and Fast Convergence Rates for Self-attention Bhavya Vasudeva Puneesh Deora Christos Thrampoulidis 29 13 0 08 Feb 2024
The Expressibility of Polynomial based Attention Scheme Zhao-quan Song Guangyi Xu Junze Yin 32 5 0 30 Oct 2023
Restricted Strong Convexity of Deep Learning Models with Smooth Activations A. Banerjee Pedro Cisneros-Velarde Libin Zhu M. Belkin 26 7 0 29 Sep 2022
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks Yunwen Lei Rong Jin Yiming Ying MLT 37 18 0 19 Sep 2022
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network Mo Zhou Rong Ge Chi Jin 71 44 0 04 Feb 2021
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 201 1,367 0 06 Jun 2016