Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.02098
Cited By
Self-attention Networks Localize When QK-eigenspectrum Concentrates
3 February 2024
Han Bao
Ryuichiro Hataya
Ryo Karakida
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Self-attention Networks Localize When QK-eigenspectrum Concentrates"
15 / 15 papers shown
Title
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
45
15
0
28 Dec 2023
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
60
42
0
23 Jun 2023
Birth of a Transformer: A Memory Viewpoint
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Hervé Jégou
Léon Bottou
72
93
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
50
75
0
25 May 2023
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
201
1,330
0
10 Feb 2022
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie
Aditi Raghunathan
Percy Liang
Tengyu Ma
ReLM
BDL
VPVLM
LRM
175
746
0
03 Nov 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
91
383
0
05 Mar 2021
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
341
6,731
0
23 Dec 2020
What they do when in doubt: a study of inductive biases in seq2seq learners
Eugene Kharitonov
Rahma Chaabouni
41
27
0
26 Jun 2020
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
66
279
0
10 Mar 2020
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
112
988
0
12 Feb 2020
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
Myle Ott
Sergey Edunov
Alexei Baevski
Angela Fan
Sam Gross
Nathan Ng
David Grangier
Michael Auli
VLM
FaML
95
3,147
0
01 Apr 2019
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
260
2,842
0
26 Sep 2016
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
334
10,467
0
21 Jul 2016
Generating Sequences With Recurrent Neural Networks
Alex Graves
GAN
135
4,031
0
04 Aug 2013
1