ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.02098
  4. Cited By
Self-attention Networks Localize When QK-eigenspectrum Concentrates

Self-attention Networks Localize When QK-eigenspectrum Concentrates

3 February 2024
Han Bao
Ryuichiro Hataya
Ryo Karakida
ArXivPDFHTML

Papers citing "Self-attention Networks Localize When QK-eigenspectrum Concentrates"

12 / 12 papers shown
Title
Spike No More: Stabilizing the Pre-training of Large Language Models
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
42
15
0
28 Dec 2023
Max-Margin Token Selection in Attention Mechanism
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
58
42
0
23 Jun 2023
Birth of a Transformer: A Memory Viewpoint
Birth of a Transformer: A Memory Viewpoint
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Hervé Jégou
Léon Bottou
72
93
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in
  1-layer Transformer
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
44
75
0
25 May 2023
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
185
1,330
0
10 Feb 2022
An Explanation of In-context Learning as Implicit Bayesian Inference
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie
Aditi Raghunathan
Percy Liang
Tengyu Ma
ReLM
BDL
VPVLM
LRM
162
746
0
03 Nov 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
91
383
0
05 Mar 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
339
6,728
0
23 Dec 2020
On Layer Normalization in the Transformer Architecture
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
110
988
0
12 Feb 2020
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
252
2,842
0
26 Sep 2016
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
326
10,464
0
21 Jul 2016
Generating Sequences With Recurrent Neural Networks
Generating Sequences With Recurrent Neural Networks
Alex Graves
GAN
129
4,031
0
04 Aug 2013
1