Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.00802
Cited By
v1
v2 (latest)
Birth of a Transformer: A Memory Viewpoint
1 June 2023
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Hervé Jégou
Léon Bottou
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Birth of a Transformer: A Memory Viewpoint"
50 / 69 papers shown
Title
Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity
Ningyuan Huang
Miguel Sarabia
Abhinav Moudgil
P. Rodríguez
Luca Zappella
Federico Danieli
Mamba
77
0
0
13 Jun 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang
Hanlin Zhu
Tianyu Guo
Jiantao Jiao
Somayeh Sojoudi
Michael I. Jordan
Stuart Russell
Song Mei
LRM
165
0
0
12 Jun 2025
Pre-trained Large Language Models Learn Hidden Markov Models In-context
Yijia Dai
Zhaolin Gao
Yahya Sattar
Sarah Dean
Jennifer J. Sun
60
0
0
08 Jun 2025
Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
Yihe Dong
Lorenzo Noci
Mikhail Khodak
Mufan Li
76
1
0
01 Jun 2025
LoLA: Low-Rank Linear Attention With Sparse Caching
Luke McDermott
Robert W. Heath Jr.
Rahul Parhi
RALM
67
0
0
29 May 2025
ATLAS: Learning to Optimally Memorize the Context at Test Time
Ali Behrouz
Zeman Li
Praneeth Kacham
Majid Daliri
Yuan Deng
Peilin Zhong
Meisam Razaviyayn
Vahab Mirrokni
112
2
0
29 May 2025
The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet
Francesco dÁngelo
Andrew Kyle Lampinen
Stephanie C. Y. Chan
224
1
0
23 May 2025
One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks
Quan Nguyen
Thanh Nguyen-Tang
MLT
108
0
0
21 May 2025
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu
Kayo Yin
Michael I. Jordan
Jacob Steinhardt
Lijie Chen
166
3
0
08 May 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
166
1
0
22 Apr 2025
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Ali Behrouz
Meisam Razaviyayn
Peilin Zhong
Vahab Mirrokni
128
5
0
17 Apr 2025
Approximation Bounds for Transformer Networks with Application to Regression
Yuling Jiao
Yanming Lai
Defeng Sun
Yang Wang
Bokai Yan
147
0
0
16 Apr 2025
Taming Knowledge Conflicts in Language Models
Gaotang Li
Yuzhong Chen
Hanghang Tong
KELM
98
2
0
14 Mar 2025
Real-Time Personalization with Simple Transformers
Lin An
Andrew A. Li
Vaisnavi Nemala
Gabriel Visotsky
92
0
0
01 Mar 2025
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization
Yunzhe Hu
Difan Zou
Dong Xu
165
1
0
17 Feb 2025
Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?
Yutong Yin
Zhaoran Wang
LRM
ReLM
502
1
0
27 Jan 2025
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
145
4
0
17 Dec 2024
Rethinking Associative Memory Mechanism in Induction Head
Shuo Wang
Issei Sato
206
0
0
16 Dec 2024
An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Yunzhe Hu
Difan Zou
Dong Xu
149
1
0
26 Nov 2024
Leveraging Large Language Models for Enhancing Public Transit Services
Jiahao Wang
Amer Shalaby
56
2
0
18 Oct 2024
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
84
14
0
17 Oct 2024
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent
Bo Chen
Xiaoyu Li
Yingyu Liang
Zhenmei Shi
Zhao Song
158
22
0
15 Oct 2024
Learning Linear Attention in Polynomial Time
Morris Yau
Ekin Akyürek
Jiayuan Mao
Joshua B. Tenenbaum
Stefanie Jegelka
Jacob Andreas
57
2
0
14 Oct 2024
Zero-Shot Generalization of Vision-Based RL Without Data Augmentation
Sumeet Batra
Gaurav Sukhatme
OffRL
DRL
85
2
0
09 Oct 2024
Transformers learn variable-order Markov chains in-context
Ruida Zhou
C. Tian
Suhas Diggavi
80
1
0
07 Oct 2024
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
Chuanyang Zheng
Yihang Gao
Han Shi
Jing Xiong
Jiankai Sun
...
Xiaozhe Ren
Michael Ng
Xin Jiang
Zhenguo Li
Yu Li
87
3
0
07 Oct 2024
Density estimation with LLMs: a geometric investigation of in-context learning trajectories
Toni J. B. Liu
Nicolas Boullé
Raphaël Sarfati
Christopher Earls
100
1
0
07 Oct 2024
Large Language Models as Markov Chains
Oussama Zekri
Ambroise Odonnat
Abdelhakim Benechehab
Linus Bleistein
Nicolas Boullé
I. Redko
132
15
0
03 Oct 2024
Attention layers provably solve single-location regression
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
490
7
0
02 Oct 2024
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
129
29
0
05 Sep 2024
One-layer transformers fail to solve the induction heads task
Clayton Sanford
Daniel J. Hsu
Matus Telgarsky
108
12
0
26 Aug 2024
Spin glass model of in-context learning
Yuhao Li
Ruoran Bai
Haiping Huang
LRM
160
0
0
05 Aug 2024
MCGMark: An Encodable and Robust Online Watermark for Tracing LLM-Generated Malicious Code
Peng Ding
Jingyu Wu
Qingyuan Zhong
Dan Ma
Xunliang Cai
...
Shi Chen
Weizhe Zhang
Zibin Zheng
Weizhe Zhang
Zibin Zheng
117
0
0
02 Aug 2024
Transformers on Markov Data: Constant Depth Suffices
Nived Rajaraman
Marco Bondaschi
Kannan Ramchandran
Michael C. Gastpar
Ashok Vardhan Makkuva
89
7
0
25 Jul 2024
Empirical Capacity Model for Self-Attention Neural Networks
Aki Härmä
M. Pietrasik
Anna Wilbik
83
2
0
22 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
203
36
0
02 Jul 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLL
KELM
104
8
0
26 Jun 2024
Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers
Brian K Chen
Tianyang Hu
Hui Jin
Hwee Kuan Lee
Kenji Kawaguchi
97
2
0
05 Jun 2024
Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task
Siavash Golkar
Alberto Bietti
Mariel Pettee
Michael Eickenberg
M. Cranmer
...
Ruben Ohana
Liam Parker
Bruno Régaldo-Saint Blancard
Kyunghyun Cho
Shirley Ho
94
2
0
30 May 2024
TAIA: Large Language Models are Out-of-Distribution Data Learners
Shuyang Jiang
Yusheng Liao
Ya Zhang
Yu Wang
Yanfeng Wang
93
5
0
30 May 2024
Why Larger Language Models Do In-context Learning Differently?
Zhenmei Shi
Junyi Wei
Zhuoyan Xu
Yingyu Liang
83
27
0
30 May 2024
CHANI: Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration
Sophie Jaffard
Samuel Vaiter
Patricia Reynaud-Bouret
164
0
0
29 May 2024
IM-Context: In-Context Learning for Imbalanced Regression Tasks
Ismail Nejjar
Faez Ahmed
Olga Fink
91
1
0
28 May 2024
Asymptotic theory of in-context learning by linear attention
Yue M. Lu
Mary I. Letey
Jacob A. Zavatone-Veth
Anindita Maiti
Cengiz Pehlevan
123
20
0
20 May 2024
Memory Mosaics
Jianyu Zhang
Niklas Nolte
Ranajoy Sadhukhan
Beidi Chen
Léon Bottou
VLM
156
5
0
10 May 2024
Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics
Hanlin Zhu
Baihe Huang
Shaolun Zhang
Michael I. Jordan
Jiantao Jiao
Yuandong Tian
Stuart Russell
LRM
AI4CE
122
18
0
07 May 2024
BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model
Chenwei Xu
Yu-Chao Huang
Jerry Yao-Chieh Hu
Weijian Li
Ammar Gilani
H. Goan
Han Liu
87
21
0
04 Apr 2024
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Jerry Yao-Chieh Hu
Pei-Hsuan Chang
Haozheng Luo
Hong-Yu Chen
Weijian Li
Wei-Po Wang
Han Liu
98
29
0
04 Apr 2024
The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models
Carlo Nicolini
Jacopo Staiano
Bruno Lepri
Raffaele Marino
MoE
72
1
0
13 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
111
34
0
29 Feb 2024
1
2
Next