Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.15417
Cited By
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations
20 February 2025
Yize Zhao
Tina Behnia
V. Vakilian
Christos Thrampoulidis
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations"
48 / 48 papers shown
Title
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
Wenlong Deng
Yi Ren
Muchen Li
Danica J. Sutherland
Xiaoxiao Li
Christos Thrampoulidis
61
0
0
24 May 2025
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Peter Súkeník
Christoph H. Lampert
Marco Mondelli
56
0
0
21 May 2025
On the Geometry of Semantics in Next-token Prediction
Yize Zhao
Christos Thrampoulidis
50
0
0
13 May 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
39
1
0
06 Apr 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Ruixi Lin
Ziqiao Wang
Yang You
FaML
124
1
0
07 Mar 2025
Reasoning Bias of Next Token Prediction Training
Pengxiao Lin
Zhongwang Zhang
Zhi-Qin John Xu
LRM
181
2
0
21 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
143
7
0
17 Jan 2025
OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
Stephen Zhang
Vardan Papyan
VLM
135
3
0
20 Sep 2024
Linguistic Collapse: Neural Collapse in (Large) Language Models
Robert Wu
Vardan Papyan
87
16
0
28 May 2024
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Benjamin L. Edelman
Ezra Edelman
Surbhi Goel
Eran Malach
Nikolaos Tsilivis
BDL
75
55
0
16 Feb 2024
Pushing Boundaries: Mixup's Influence on Neural Collapse
Quinn Fisher
Haoming Meng
Vardan Papyan
AAML
UQCV
68
5
0
09 Feb 2024
Anisotropy Is Inherent to Self-Attention in Transformers
Nathan Godey
Eric Villemonte de la Clergerie
Benoît Sagot
43
19
0
22 Jan 2024
Neural Collapse in Multi-label Learning with Pick-all-label Loss
Pengyu Li
Xiao Li
Yutong Wang
Qing Qu
54
9
0
24 Oct 2023
On the Implicit Bias of Adam
M. D. Cattaneo
Jason M. Klusowski
Boris Shigida
72
18
0
31 Aug 2023
Grokking of Hierarchical Structure in Vanilla Transformers
Shikhar Murty
Pratyusha Sharma
Jacob Andreas
Christopher D. Manning
87
48
0
30 May 2023
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Ronen Eldan
Yuan-Fang Li
SyDa
LRM
75
266
0
12 May 2023
On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data
Tina Behnia
Ganesh Ramachandra Kini
V. Vakilian
Christos Thrampoulidis
85
17
0
14 Mar 2023
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Hong Liu
Sang Michael Xie
Zhiyuan Li
Tengyu Ma
AI4CE
116
55
0
25 Oct 2022
Imbalance Trouble: Revisiting Neural-Collapse Geometry
Christos Thrampoulidis
Ganesh Ramachandra Kini
V. Vakilian
Tina Behnia
58
73
0
10 Aug 2022
Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently
Haoyuan Sun
Kwangjun Ahn
Christos Thrampoulidis
Navid Azizan
OOD
47
21
0
25 May 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power
Yuri Burda
Harrison Edwards
Igor Babuschkin
Vedant Misra
83
364
0
06 Jan 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
William Timkey
Marten van Schijndel
274
116
0
09 Sep 2021
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
Scott Pesme
Loucas Pillaud-Vivien
Nicolas Flammarion
54
106
0
17 Jun 2021
A Geometric Analysis of Neural Collapse with Unconstrained Features
Zhihui Zhu
Tianyu Ding
Jinxin Zhou
Xiao Li
Chong You
Jeremias Sulam
Qing Qu
69
205
0
06 May 2021
Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training
Cong Fang
Hangfeng He
Qi Long
Weijie J. Su
FAtt
167
172
0
29 Jan 2021
Implicit Regularization in ReLU Networks with the Square Loss
Gal Vardi
Ohad Shamir
55
51
0
09 Dec 2020
Neural collapse with unconstrained features
D. Mixon
Hans Parshall
Jianzong Pi
72
121
0
23 Nov 2020
A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Nikunj Saunshi
Sadhika Malladi
Sanjeev Arora
82
88
0
07 Oct 2020
Prevalence of Neural Collapse during the terminal phase of deep learning training
Vardan Papyan
Xuemei Han
D. Donoho
205
578
0
18 Aug 2020
Gradient descent follows the regularization path for general losses
Ziwei Ji
Miroslav Dudík
Robert Schapire
Matus Telgarsky
AI4CE
FaML
138
62
0
19 Jun 2020
Directional convergence and alignment in deep learning
Ziwei Ji
Matus Telgarsky
59
171
0
11 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
838
42,332
0
28 May 2020
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings
Kawin Ethayarajh
86
875
0
02 Sep 2019
Representation Degeneration Problem in Training Natural Language Generation Models
Jun Gao
Di He
Xu Tan
Tao Qin
Liwei Wang
Tie-Yan Liu
62
270
0
28 Jul 2019
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
Kaifeng Lyu
Jian Li
89
336
0
13 Jun 2019
Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization
Navid Azizan
Sahin Lale
B. Hassibi
146
73
0
10 Jun 2019
Analogies Explained: Towards Understanding Word Embeddings
Carl Allen
Timothy M. Hospedales
69
144
0
28 Jan 2019
Towards Understanding Linear Word Analogies
Kawin Ethayarajh
David Duvenaud
Graeme Hirst
67
116
0
11 Oct 2018
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Suriya Gunasekar
Jason D. Lee
Daniel Soudry
Nathan Srebro
MDE
124
413
0
01 Jun 2018
Risk and parameter convergence of logistic regression
Ziwei Ji
Matus Telgarsky
73
130
0
20 Mar 2018
Convergence of Gradient Descent on Separable Data
Mor Shpigel Nacson
Jason D. Lee
Suriya Gunasekar
Pedro H. P. Savarese
Nathan Srebro
Daniel Soudry
76
169
0
05 Mar 2018
Characterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar
Jason D. Lee
Daniel Soudry
Nathan Srebro
AI4CE
73
410
0
22 Feb 2018
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang
Zihang Dai
Ruslan Salakhutdinov
William W. Cohen
BDL
71
372
0
10 Nov 2017
The Implicit Bias of Gradient Descent on Separable Data
Daniel Soudry
Elad Hoffer
Mor Shpigel Nacson
Suriya Gunasekar
Nathan Srebro
161
921
0
27 Oct 2017
All-but-the-Top: Simple and Effective Postprocessing for Word Representations
Jiaqi Mu
S. Bhat
Pramod Viswanath
79
311
0
05 Feb 2017
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov
Ilya Sutskever
Kai Chen
G. Corrado
J. Dean
NAI
OCL
399
33,550
0
16 Oct 2013
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
680
31,538
0
16 Jan 2013
Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization
Benjamin Recht
Maryam Fazel
P. Parrilo
419
3,771
0
28 Jun 2007
1