ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.01537
  4. Cited By
Attention layers provably solve single-location regression
v1v2 (latest)

Attention layers provably solve single-location regression

2 October 2024
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
ArXiv (abs)PDFHTML

Papers citing "Attention layers provably solve single-location regression"

50 / 51 papers shown
Title
Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
Luca Arnaboldi
Bruno Loureiro
Ludovic Stephan
Florent Krzakala
Lenka Zdeborová
65
0
0
03 Jun 2025
The emergence of sparse attention: impact of data distribution and benefits of repetition
The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet
Francesco dÁngelo
Andrew Kyle Lampinen
Stephanie C. Y. Chan
214
1
0
23 May 2025
Attention-based clustering
Attention-based clustering
Rodrigo Maulen-Soto
Claire Boyer
Pierre Marion
77
0
0
19 May 2025
Ordinary Least Squares as an Attention Mechanism
Ordinary Least Squares as an Attention Mechanism
Philippe Goulet Coulombe
67
0
0
13 Apr 2025
Transformers Provably Learn Sparse Token Selection While Fully-Connected
  Nets Cannot
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Zixuan Wang
Stanley Wei
Daniel Hsu
Jason D. Lee
59
16
0
11 Jun 2024
Implicit Diffusion: Efficient Optimization through Stochastic Sampling
Implicit Diffusion: Efficient Optimization through Stochastic Sampling
Pierre Marion
Anna Korba
Peter Bartlett
Mathieu Blondel
Valentin De Bortoli
Arnaud Doucet
Felipe Llinares-López
Courtney Paquette
Quentin Berthet
154
15
0
08 Feb 2024
Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings
Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings
Andrea W Wen-Yi
David Mimno
88
16
0
29 Nov 2023
Simplifying Transformer Blocks
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
109
36
0
03 Nov 2023
Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TSRALM
165
791
0
29 Sep 2023
Vision Transformers Need Registers
Vision Transformers Need Registers
Zilong Chen
Maxime Oquab
Julien Mairal
Huaping Liu
ViT
201
357
0
28 Sep 2023
Replacing softmax with ReLU in Vision Transformers
Replacing softmax with ReLU in Vision Transformers
Mitchell Wortsman
Jaehoon Lee
Justin Gilmer
Simon Kornblith
ViT
91
33
0
15 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised
  Sequence Models
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Neel Nanda
Andrew Lee
Martin Wattenberg
FAttMILM
122
186
0
02 Sep 2023
Are Transformers with One Layer Self-Attention Using Low-Rank Weight
  Matrices Universal Approximators?
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?
T. Kajitsuka
Issei Sato
127
18
0
26 Jul 2023
Trained Transformers Learn Linear Models In-Context
Trained Transformers Learn Linear Models In-Context
Ruiqi Zhang
Spencer Frei
Peter L. Bartlett
99
207
0
16 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language
  Model
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELMHILM
160
584
0
06 Jun 2023
Birth of a Transformer: A Memory Viewpoint
Birth of a Transformer: A Memory Viewpoint
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Hervé Jégou
Léon Bottou
112
96
0
01 Jun 2023
Transformers learn to implement preconditioned gradient descent for
  in-context learning
Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn
Xiang Cheng
Hadi Daneshmand
S. Sra
ODL
95
176
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in
  1-layer Transformer
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
109
79
0
25 May 2023
Leveraging the two timescale regime to demonstrate convergence of neural
  networks
Leveraging the two timescale regime to demonstrate convergence of neural networks
Pierre Marion
Raphael Berthier
94
6
0
19 Apr 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic
  Understanding
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
173
65
0
07 Mar 2023
Learning time-scales in two-layers neural networks
Learning time-scales in two-layers neural networks
Raphael Berthier
Andrea Montanari
Kangjie Zhou
196
38
0
28 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for
  Faithful Signal Propagation
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
85
30
0
20 Feb 2023
A Study on ReLU and Softmax in Transformer
A Study on ReLU and Softmax in Transformer
Kai Shen
Junliang Guo
Xuejiao Tan
Siliang Tang
Rui Wang
Jiang Bian
104
59
0
13 Feb 2023
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
148
497
0
15 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
163
386
0
07 Dec 2022
Vision Transformers provably learn spatial structure
Vision Transformers provably learn spatial structure
Samy Jelassi
Michael E. Sander
Yuan-Fang Li
ViTMLT
100
83
0
13 Oct 2022
Formal Algorithms for Transformers
Formal Algorithms for Transformers
Mary Phuong
Marcus Hutter
62
75
0
19 Jul 2022
cosFormer: Rethinking Softmax in Attention
cosFormer: Rethinking Softmax in Attention
Zhen Qin
Weixuan Sun
Huicai Deng
Dongxu Li
Yunshen Wei
Baohong Lv
Junjie Yan
Lingpeng Kong
Yiran Zhong
97
222
0
17 Feb 2022
Sparse is Enough in Scaling Transformers
Sparse is Enough in Scaling Transformers
Sebastian Jaszczur
Aakanksha Chowdhery
Afroz Mohiuddin
Lukasz Kaiser
Wojciech Gajewski
Henryk Michalewski
Jonni Kanerva
MoE
71
102
0
24 Nov 2021
Learned Token Pruning for Transformers
Learned Token Pruning for Transformers
Sehoon Kim
Sheng Shen
D. Thorsley
A. Gholami
Woosuk Kwon
Joseph Hassoun
Kurt Keutzer
86
157
0
02 Jul 2021
A Survey of Transformers
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
202
1,148
0
08 Jun 2021
An Interpretability Illusion for BERT
An Interpretability Illusion for BERT
Tolga Bolukbasi
Adam Pearce
Ann Yuan
Andy Coenen
Emily Reif
Fernanda Viégas
Martin Wattenberg
MILMFAtt
103
82
0
14 Apr 2021
Intrinsic Dimensionality Explains the Effectiveness of Language Model
  Fine-Tuning
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan
Luke Zettlemoyer
Sonal Gupta
110
577
1
22 Dec 2020
A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis
  and Application to Actor-Critic
A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic
Mingyi Hong
Hoi-To Wai
Zhaoran Wang
Zhuoran Yang
86
141
0
10 Jul 2020
Infinite attention: NNGP and NTK for deep attention networks
Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron
Yasaman Bahri
Jascha Narain Sohl-Dickstein
Roman Novak
60
116
0
18 Jun 2020
Adaptively Sparse Transformers
Adaptively Sparse Transformers
Gonçalo M. Correia
Vlad Niculae
André F. T. Martins
133
257
0
30 Aug 2019
What Does BERT Look At? An Analysis of BERT's Attention
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
289
1,609
0
11 Jun 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
142
1,924
0
23 Apr 2019
BERT Post-Training for Review Reading Comprehension and Aspect-based
  Sentiment Analysis
BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis
Hu Xu
Bing-Quan Liu
Lei Shu
Philip S. Yu
94
700
0
03 Apr 2019
Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing
  Auxiliary Sentence
Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
Chi Sun
Luyao Huang
Xipeng Qiu
78
616
0
22 Mar 2019
Attentional Encoder Network for Targeted Sentiment Classification
Attentional Encoder Network for Targeted Sentiment Classification
Youwei Song
Jiahai Wang
Tao Jiang
Zhiyue Liu
Yanghui Rao
72
278
0
25 Feb 2019
On Lazy Training in Differentiable Programming
On Lazy Training in Differentiable Programming
Lénaïc Chizat
Edouard Oyallon
Francis R. Bach
111
840
0
19 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.9K
95,554
0
11 Oct 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
976
133,279
0
12 Jun 2017
A Regularized Framework for Sparse and Structured Neural Attention
A Regularized Framework for Sparse and Structured Neural Attention
Vlad Niculae
Mathieu Blondel
87
100
0
22 May 2017
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word
  Embeddings
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Tolga Bolukbasi
Kai-Wei Chang
James Zou
Venkatesh Saligrama
Adam Kalai
CVBMFaML
118
3,161
0
21 Jul 2016
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label
  Classification
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
André F. T. Martins
Ramón Fernández Astudillo
234
726
0
05 Feb 2016
End-to-End Attention-based Large Vocabulary Speech Recognition
End-to-End Attention-based Large Vocabulary Speech Recognition
Dzmitry Bahdanau
J. Chorowski
Dmitriy Serdyuk
Philemon Brakel
Yoshua Bengio
132
1,152
0
18 Aug 2015
Effective Approaches to Attention-based Neural Machine Translation
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong
Hieu H. Pham
Christopher D. Manning
486
7,976
0
17 Aug 2015
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau
Kyunghyun Cho
Yoshua Bengio
AIMat
702
27,369
0
01 Sep 2014
12
Next