Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.01537
Cited By
v1
v2 (latest)
Attention layers provably solve single-location regression
2 October 2024
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Attention layers provably solve single-location regression"
50 / 51 papers shown
Title
Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
Luca Arnaboldi
Bruno Loureiro
Ludovic Stephan
Florent Krzakala
Lenka Zdeborová
65
0
0
03 Jun 2025
The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet
Francesco dÁngelo
Andrew Kyle Lampinen
Stephanie C. Y. Chan
214
1
0
23 May 2025
Attention-based clustering
Rodrigo Maulen-Soto
Claire Boyer
Pierre Marion
77
0
0
19 May 2025
Ordinary Least Squares as an Attention Mechanism
Philippe Goulet Coulombe
67
0
0
13 Apr 2025
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Zixuan Wang
Stanley Wei
Daniel Hsu
Jason D. Lee
59
16
0
11 Jun 2024
Implicit Diffusion: Efficient Optimization through Stochastic Sampling
Pierre Marion
Anna Korba
Peter Bartlett
Mathieu Blondel
Valentin De Bortoli
Arnaud Doucet
Felipe Llinares-López
Courtney Paquette
Quentin Berthet
154
15
0
08 Feb 2024
Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings
Andrea W Wen-Yi
David Mimno
88
16
0
29 Nov 2023
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
109
36
0
03 Nov 2023
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TS
RALM
165
791
0
29 Sep 2023
Vision Transformers Need Registers
Zilong Chen
Maxime Oquab
Julien Mairal
Huaping Liu
ViT
201
357
0
28 Sep 2023
Replacing softmax with ReLU in Vision Transformers
Mitchell Wortsman
Jaehoon Lee
Justin Gilmer
Simon Kornblith
ViT
91
33
0
15 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Neel Nanda
Andrew Lee
Martin Wattenberg
FAtt
MILM
122
186
0
02 Sep 2023
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?
T. Kajitsuka
Issei Sato
127
18
0
26 Jul 2023
Trained Transformers Learn Linear Models In-Context
Ruiqi Zhang
Spencer Frei
Peter L. Bartlett
99
207
0
16 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
160
584
0
06 Jun 2023
Birth of a Transformer: A Memory Viewpoint
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Hervé Jégou
Léon Bottou
112
96
0
01 Jun 2023
Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn
Xiang Cheng
Hadi Daneshmand
S. Sra
ODL
95
176
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
109
79
0
25 May 2023
Leveraging the two timescale regime to demonstrate convergence of neural networks
Pierre Marion
Raphael Berthier
94
6
0
19 Apr 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
173
65
0
07 Mar 2023
Learning time-scales in two-layers neural networks
Raphael Berthier
Andrea Montanari
Kangjie Zhou
196
38
0
28 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
85
30
0
20 Feb 2023
A Study on ReLU and Softmax in Transformer
Kai Shen
Junliang Guo
Xuejiao Tan
Siliang Tang
Rui Wang
Jiang Bian
104
59
0
13 Feb 2023
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
148
497
0
15 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
163
386
0
07 Dec 2022
Vision Transformers provably learn spatial structure
Samy Jelassi
Michael E. Sander
Yuan-Fang Li
ViT
MLT
100
83
0
13 Oct 2022
Formal Algorithms for Transformers
Mary Phuong
Marcus Hutter
62
75
0
19 Jul 2022
cosFormer: Rethinking Softmax in Attention
Zhen Qin
Weixuan Sun
Huicai Deng
Dongxu Li
Yunshen Wei
Baohong Lv
Junjie Yan
Lingpeng Kong
Yiran Zhong
97
222
0
17 Feb 2022
Sparse is Enough in Scaling Transformers
Sebastian Jaszczur
Aakanksha Chowdhery
Afroz Mohiuddin
Lukasz Kaiser
Wojciech Gajewski
Henryk Michalewski
Jonni Kanerva
MoE
71
102
0
24 Nov 2021
Learned Token Pruning for Transformers
Sehoon Kim
Sheng Shen
D. Thorsley
A. Gholami
Woosuk Kwon
Joseph Hassoun
Kurt Keutzer
86
157
0
02 Jul 2021
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
202
1,148
0
08 Jun 2021
An Interpretability Illusion for BERT
Tolga Bolukbasi
Adam Pearce
Ann Yuan
Andy Coenen
Emily Reif
Fernanda Viégas
Martin Wattenberg
MILM
FAtt
103
82
0
14 Apr 2021
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan
Luke Zettlemoyer
Sonal Gupta
110
577
1
22 Dec 2020
A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic
Mingyi Hong
Hoi-To Wai
Zhaoran Wang
Zhuoran Yang
86
141
0
10 Jul 2020
Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron
Yasaman Bahri
Jascha Narain Sohl-Dickstein
Roman Novak
60
116
0
18 Jun 2020
Adaptively Sparse Transformers
Gonçalo M. Correia
Vlad Niculae
André F. T. Martins
133
257
0
30 Aug 2019
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
289
1,609
0
11 Jun 2019
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
142
1,924
0
23 Apr 2019
BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis
Hu Xu
Bing-Quan Liu
Lei Shu
Philip S. Yu
94
700
0
03 Apr 2019
Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
Chi Sun
Luyao Huang
Xipeng Qiu
78
616
0
22 Mar 2019
Attentional Encoder Network for Targeted Sentiment Classification
Youwei Song
Jiahai Wang
Tao Jiang
Zhiyue Liu
Yanghui Rao
72
278
0
25 Feb 2019
On Lazy Training in Differentiable Programming
Lénaïc Chizat
Edouard Oyallon
Francis R. Bach
111
840
0
19 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.9K
95,554
0
11 Oct 2018
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
976
133,279
0
12 Jun 2017
A Regularized Framework for Sparse and Structured Neural Attention
Vlad Niculae
Mathieu Blondel
87
100
0
22 May 2017
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Tolga Bolukbasi
Kai-Wei Chang
James Zou
Venkatesh Saligrama
Adam Kalai
CVBM
FaML
118
3,161
0
21 Jul 2016
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
André F. T. Martins
Ramón Fernández Astudillo
234
726
0
05 Feb 2016
End-to-End Attention-based Large Vocabulary Speech Recognition
Dzmitry Bahdanau
J. Chorowski
Dmitriy Serdyuk
Philemon Brakel
Yoshua Bengio
132
1,152
0
18 Aug 2015
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong
Hieu H. Pham
Christopher D. Manning
486
7,976
0
17 Aug 2015
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau
Kyunghyun Cho
Yoshua Bengio
AIMat
702
27,369
0
01 Sep 2014
1
2
Next