Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

20 February 2025

Yize Zhao

Tina Behnia

V. Vakilian

Christos Thrampoulidis

ArXiv (abs)PDF HTML

Papers citing "Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations"

48 / 48 papers shown

Title
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization Wenlong Deng Yi Ren Muchen Li Danica J. Sutherland Xiaoxiao Li Christos Thrampoulidis 61 0 0 24 May 2025
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers Peter Súkeník Christoph H. Lampert Marco Mondelli 56 0 0 21 May 2025
On the Geometry of Semantics in Next-token Prediction Yize Zhao Christos Thrampoulidis 50 0 0 13 May 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning Yingcong Li Davoud Ataee Tarzanagh A. S. Rawat Maryam Fazel Samet Oymak 39 1 0 06 Apr 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy Ruixi Lin Ziqiao Wang Yang You FaML 124 1 0 07 Mar 2025
Reasoning Bias of Next Token Prediction Training Pengxiao Lin Zhongwang Zhang Zhi-Qin John Xu LRM 181 2 0 21 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 143 7 0 17 Jan 2025
OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition Stephen Zhang Vardan Papyan VLM 135 3 0 20 Sep 2024
Linguistic Collapse: Neural Collapse in (Large) Language Models Robert Wu Vardan Papyan 87 16 0 28 May 2024
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains Benjamin L. Edelman Ezra Edelman Surbhi Goel Eran Malach Nikolaos Tsilivis BDL 75 55 0 16 Feb 2024
Pushing Boundaries: Mixup's Influence on Neural Collapse Quinn Fisher Haoming Meng Vardan Papyan AAML UQCV 68 5 0 09 Feb 2024
Anisotropy Is Inherent to Self-Attention in Transformers Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 43 19 0 22 Jan 2024
Neural Collapse in Multi-label Learning with Pick-all-label Loss Pengyu Li Xiao Li Yutong Wang Qing Qu 54 9 0 24 Oct 2023
On the Implicit Bias of Adam M. D. Cattaneo Jason M. Klusowski Boris Shigida 72 18 0 31 Aug 2023
Grokking of Hierarchical Structure in Vanilla Transformers Shikhar Murty Pratyusha Sharma Jacob Andreas Christopher D. Manning 87 48 0 30 May 2023
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Ronen Eldan Yuan-Fang Li SyDa LRM 75 266 0 12 May 2023
On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data Tina Behnia Ganesh Ramachandra Kini V. Vakilian Christos Thrampoulidis 85 17 0 14 Mar 2023
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma AI4CE 116 55 0 25 Oct 2022
Imbalance Trouble: Revisiting Neural-Collapse Geometry Christos Thrampoulidis Ganesh Ramachandra Kini V. Vakilian Tina Behnia 58 73 0 10 Aug 2022
Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently Haoyuan Sun Kwangjun Ahn Christos Thrampoulidis Navid Azizan OOD 47 21 0 25 May 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power Yuri Burda Harrison Edwards Igor Babuschkin Vedant Misra 83 364 0 06 Jan 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality William Timkey Marten van Schijndel 274 116 0 09 Sep 2021
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity Scott Pesme Loucas Pillaud-Vivien Nicolas Flammarion 54 106 0 17 Jun 2021
A Geometric Analysis of Neural Collapse with Unconstrained Features Zhihui Zhu Tianyu Ding Jinxin Zhou Xiao Li Chong You Jeremias Sulam Qing Qu 69 205 0 06 May 2021
Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training Cong Fang Hangfeng He Qi Long Weijie J. Su FAtt 167 172 0 29 Jan 2021
Implicit Regularization in ReLU Networks with the Square Loss Gal Vardi Ohad Shamir 55 51 0 09 Dec 2020
Neural collapse with unconstrained features D. Mixon Hans Parshall Jianzong Pi 72 121 0 23 Nov 2020
A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks Nikunj Saunshi Sadhika Malladi Sanjeev Arora 82 88 0 07 Oct 2020
Prevalence of Neural Collapse during the terminal phase of deep learning training Vardan Papyan Xuemei Han D. Donoho 205 578 0 18 Aug 2020
Gradient descent follows the regularization path for general losses Ziwei Ji Miroslav Dudík Robert Schapire Matus Telgarsky AI4CE FaML 138 62 0 19 Jun 2020
Directional convergence and alignment in deep learning Ziwei Ji Matus Telgarsky 59 171 0 11 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 838 42,332 0 28 May 2020
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings Kawin Ethayarajh 86 875 0 02 Sep 2019
Representation Degeneration Problem in Training Natural Language Generation Models Jun Gao Di He Xu Tan Tao Qin Liwei Wang Tie-Yan Liu 62 270 0 28 Jul 2019
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks Kaifeng Lyu Jian Li 89 336 0 13 Jun 2019
Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization Navid Azizan Sahin Lale B. Hassibi 146 73 0 10 Jun 2019
Analogies Explained: Towards Understanding Word Embeddings Carl Allen Timothy M. Hospedales 69 144 0 28 Jan 2019
Towards Understanding Linear Word Analogies Kawin Ethayarajh David Duvenaud Graeme Hirst 67 116 0 11 Oct 2018
Implicit Bias of Gradient Descent on Linear Convolutional Networks Suriya Gunasekar Jason D. Lee Daniel Soudry Nathan Srebro MDE 124 413 0 01 Jun 2018
Risk and parameter convergence of logistic regression Ziwei Ji Matus Telgarsky 73 130 0 20 Mar 2018
Convergence of Gradient Descent on Separable Data Mor Shpigel Nacson Jason D. Lee Suriya Gunasekar Pedro H. P. Savarese Nathan Srebro Daniel Soudry 76 169 0 05 Mar 2018
Characterizing Implicit Bias in Terms of Optimization Geometry Suriya Gunasekar Jason D. Lee Daniel Soudry Nathan Srebro AI4CE 73 410 0 22 Feb 2018
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Zhilin Yang Zihang Dai Ruslan Salakhutdinov William W. Cohen BDL 71 372 0 10 Nov 2017
The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Suriya Gunasekar Nathan Srebro 161 921 0 27 Oct 2017
All-but-the-Top: Simple and Effective Postprocessing for Word Representations Jiaqi Mu S. Bhat Pramod Viswanath 79 311 0 05 Feb 2017
Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov Ilya Sutskever Kai Chen G. Corrado J. Dean NAI OCL 399 33,550 0 16 Oct 2013
Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Kai Chen G. Corrado J. Dean 3DV 680 31,538 0 16 Jan 2013
Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization Benjamin Recht Maryam Fazel P. Parrilo 419 3,771 0 28 Jun 2007