ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.02247
  4. Cited By
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

3 October 2024
Xinhao Yao
Hongjin Qian
Xiaolin Hu
Gengze Xu
Wei Liu
Jian Luan
Bin Wang
Yang Liu
ArXivPDFHTML

Papers citing "Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization"

46 / 46 papers shown
Title
The Impact of Initialization on LoRA Finetuning Dynamics
The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
52
19
0
12 Jun 2024
Enhancing In-Context Learning Performance with just SVD-Based Weight
  Pruning: A Theoretical Perspective
Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective
Xinhao Yao
Xiaolin Hu
Shenzhi Yang
Yong Liu
59
2
0
06 Jun 2024
Sparse Matrix in Large Language Model Fine-tuning
Sparse Matrix in Large Language Model Fine-tuning
Haoze He
Juncheng Billy Li
Xuan Jiang
Heather Miller
MoE
43
3
0
24 May 2024
Asymmetry in Low-Rank Adapters of Foundation Models
Asymmetry in Low-Rank Adapters of Foundation Models
Jiacheng Zhu
Kristjan Greenewald
Kimia Nadjahi
Haitz Sáez de Ocáriz Borde
Rickard Brüel-Gabrielsson
Leshem Choshen
Marzyeh Ghassemi
Mikhail Yurochkin
Justin Solomon
61
30
0
26 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models
LoRA+: Efficient Low Rank Adaptation of Large Models
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
64
161
0
19 Feb 2024
DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-yang Liu
Chien-Yi Wang
Hongxu Yin
Pavlo Molchanov
Yu-Chiang Frank Wang
Kwang-Ting Cheng
Min-Hung Chen
57
374
0
14 Feb 2024
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
Haotong Qin
Xudong Ma
Xingyu Zheng
Xiaoyang Li
Yang Zhang
Shouda Liu
Jie Luo
Xianglong Liu
Michele Magno
MQ
31
37
0
08 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
25
5
0
03 Feb 2024
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
Greg Yang
Dingli Yu
Chen Zhu
Soufiane Hayou
MLT
49
31
0
03 Oct 2023
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
  Models
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
L. Yu
Weisen Jiang
Han Shi
Jincheng Yu
Zhengying Liu
Yu Zhang
James T. Kwok
Zheng Li
Adrian Weller
Weiyang Liu
OSLM
LRM
57
363
0
21 Sep 2023
Max-Margin Token Selection in Attention Mechanism
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
56
42
0
23 Jun 2023
Transformers learn to implement preconditioned gradient descent for
  in-context learning
Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn
Xiang Cheng
Hadi Daneshmand
S. Sra
ODL
59
159
0
01 Jun 2023
Grokking of Hierarchical Structure in Vanilla Transformers
Grokking of Hierarchical Structure in Vanilla Transformers
Shikhar Murty
Pratyusha Sharma
Jacob Andreas
Christopher D. Manning
51
46
0
30 May 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in
  1-layer Transformer
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
38
74
0
25 May 2023
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
Vladislav Lialin
Vijeta Deshpande
Anna Rumshisky
52
170
0
28 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic
  Understanding
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
137
63
0
07 Mar 2023
Deep Transformers without Shortcuts: Modifying Self-attention for
  Faithful Signal Propagation
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
33
30
0
20 Feb 2023
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of
  SGD via Training Trajectories and via Terminal States
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Ziqiao Wang
Yongyi Mao
43
10
0
19 Nov 2022
On the infinite-depth limit of finite-width neural networks
On the infinite-depth limit of finite-width neural networks
Soufiane Hayou
43
22
0
03 Oct 2022
Signal Propagation in Transformers: Theoretical Perspectives and the
  Role of Rank Collapse
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurelien Lucchi
72
70
0
07 Jun 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
100
1,894
0
29 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
55
155
0
07 Mar 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
121
1,275
0
10 Feb 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
174
4,175
0
27 Oct 2021
Towards a Unified View of Parameter-Efficient Transfer Learning
Towards a Unified View of Parameter-Efficient Transfer Learning
Junxian He
Chunting Zhou
Xuezhe Ma
Taylor Berg-Kirkpatrick
Graham Neubig
AAML
74
911
0
08 Oct 2021
Noisy Channel Language Model Prompting for Few-Shot Text Classification
Noisy Channel Language Model Prompting for Few-Shot Text Classification
Sewon Min
Michael Lewis
Hannaneh Hajishirzi
Luke Zettlemoyer
VLM
57
219
0
09 Aug 2021
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
173
9,946
0
17 Jun 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
426
3,952
0
18 Apr 2021
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li
Percy Liang
148
4,167
0
01 Jan 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
238
6,657
0
23 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
180
40,217
0
22 Oct 2020
SlimIPL: Language-Model-Free Iterative Pseudo-Labeling
SlimIPL: Language-Model-Free Iterative Pseudo-Labeling
Tatiana Likhomanenko
Qiantong Xu
Jacob Kahn
Gabriel Synnaeve
R. Collobert
VLM
70
61
0
22 Oct 2020
What they do when in doubt: a study of inductive biases in seq2seq
  learners
What they do when in doubt: a study of inductive biases in seq2seq learners
Eugene Kharitonov
Rahma Chaabouni
24
27
0
26 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
385
41,106
0
28 May 2020
Reasoning About Generalization via Conditional Mutual Information
Reasoning About Generalization via Conditional Mutual Information
Thomas Steinke
Lydia Zakynthinou
69
162
0
24 Jan 2020
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
348
24,160
0
26 Jul 2019
On the Impact of the Activation Function on Deep Neural Networks
  Training
On the Impact of the Activation Function on Deep Neural Networks Training
Soufiane Hayou
Arnaud Doucet
Judith Rousseau
ODL
40
197
0
19 Feb 2019
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian
  Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
Greg Yang
67
286
0
13 Feb 2019
Parameter-Efficient Transfer Learning for NLP
Parameter-Efficient Transfer Learning for NLP
N. Houlsby
A. Giurgiu
Stanislaw Jastrzebski
Bruna Morrone
Quentin de Laroussilhe
Andrea Gesmundo
Mona Attariyan
Sylvain Gelly
154
4,368
0
02 Feb 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
553
7,080
0
20 Apr 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
319
129,831
0
12 Jun 2017
Information-theoretic analysis of generalization capability of learning
  algorithms
Information-theoretic analysis of generalization capability of learning algorithms
Aolin Xu
Maxim Raginsky
79
442
0
22 May 2017
Deep Information Propagation
Deep Information Propagation
S. Schoenholz
Justin Gilmer
Surya Ganguli
Jascha Narain Sohl-Dickstein
39
363
0
04 Nov 2016
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
1.1K
192,638
0
10 Dec 2015
How much does your data exploration overfit? Controlling bias via
  information usage
How much does your data exploration overfit? Controlling bias via information usage
D. Russo
James Zou
26
189
0
16 Nov 2015
Generating Sequences With Recurrent Neural Networks
Generating Sequences With Recurrent Neural Networks
Alex Graves
GAN
83
4,025
0
04 Aug 2013
1