ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.15417
  4. Cited By
Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

20 February 2025
Yize Zhao
Tina Behnia
V. Vakilian
Christos Thrampoulidis
ArXiv (abs)PDFHTML

Papers citing "Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations"

48 / 48 papers shown
Title
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
Wenlong Deng
Yi Ren
Muchen Li
Danica J. Sutherland
Xiaoxiao Li
Christos Thrampoulidis
61
0
0
24 May 2025
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Peter Súkeník
Christoph H. Lampert
Marco Mondelli
56
0
0
21 May 2025
On the Geometry of Semantics in Next-token Prediction
On the Geometry of Semantics in Next-token Prediction
Yize Zhao
Christos Thrampoulidis
50
0
0
13 May 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
39
1
0
06 Apr 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Ruixi Lin
Ziqiao Wang
Yang You
FaML
124
1
0
07 Mar 2025
Reasoning Bias of Next Token Prediction Training
Reasoning Bias of Next Token Prediction Training
Pengxiao Lin
Zhongwang Zhang
Zhi-Qin John Xu
LRM
181
2
0
21 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
143
7
0
17 Jan 2025
OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
Stephen Zhang
Vardan Papyan
VLM
135
3
0
20 Sep 2024
Linguistic Collapse: Neural Collapse in (Large) Language Models
Linguistic Collapse: Neural Collapse in (Large) Language Models
Robert Wu
Vardan Papyan
87
16
0
28 May 2024
The Evolution of Statistical Induction Heads: In-Context Learning Markov
  Chains
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Benjamin L. Edelman
Ezra Edelman
Surbhi Goel
Eran Malach
Nikolaos Tsilivis
BDL
75
55
0
16 Feb 2024
Pushing Boundaries: Mixup's Influence on Neural Collapse
Pushing Boundaries: Mixup's Influence on Neural Collapse
Quinn Fisher
Haoming Meng
Vardan Papyan
AAMLUQCV
68
5
0
09 Feb 2024
Anisotropy Is Inherent to Self-Attention in Transformers
Anisotropy Is Inherent to Self-Attention in Transformers
Nathan Godey
Eric Villemonte de la Clergerie
Benoît Sagot
43
19
0
22 Jan 2024
Neural Collapse in Multi-label Learning with Pick-all-label Loss
Neural Collapse in Multi-label Learning with Pick-all-label Loss
Pengyu Li
Xiao Li
Yutong Wang
Qing Qu
54
9
0
24 Oct 2023
On the Implicit Bias of Adam
On the Implicit Bias of Adam
M. D. Cattaneo
Jason M. Klusowski
Boris Shigida
72
18
0
31 Aug 2023
Grokking of Hierarchical Structure in Vanilla Transformers
Grokking of Hierarchical Structure in Vanilla Transformers
Shikhar Murty
Pratyusha Sharma
Jacob Andreas
Christopher D. Manning
87
48
0
30 May 2023
TinyStories: How Small Can Language Models Be and Still Speak Coherent
  English?
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Ronen Eldan
Yuan-Fang Li
SyDaLRM
75
266
0
12 May 2023
On the Implicit Geometry of Cross-Entropy Parameterizations for
  Label-Imbalanced Data
On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data
Tina Behnia
Ganesh Ramachandra Kini
V. Vakilian
Christos Thrampoulidis
85
17
0
14 Mar 2023
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
  Language Models
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Hong Liu
Sang Michael Xie
Zhiyuan Li
Tengyu Ma
AI4CE
116
55
0
25 Oct 2022
Imbalance Trouble: Revisiting Neural-Collapse Geometry
Imbalance Trouble: Revisiting Neural-Collapse Geometry
Christos Thrampoulidis
Ganesh Ramachandra Kini
V. Vakilian
Tina Behnia
58
73
0
10 Aug 2022
Mirror Descent Maximizes Generalized Margin and Can Be Implemented
  Efficiently
Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently
Haoyuan Sun
Kwangjun Ahn
Christos Thrampoulidis
Navid Azizan
OOD
47
21
0
25 May 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic
  Datasets
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power
Yuri Burda
Harrison Edwards
Igor Babuschkin
Vedant Misra
83
364
0
06 Jan 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models
  Obscure Representational Quality
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
William Timkey
Marten van Schijndel
274
116
0
09 Sep 2021
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
  Stochasticity
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity
Scott Pesme
Loucas Pillaud-Vivien
Nicolas Flammarion
54
106
0
17 Jun 2021
A Geometric Analysis of Neural Collapse with Unconstrained Features
A Geometric Analysis of Neural Collapse with Unconstrained Features
Zhihui Zhu
Tianyu Ding
Jinxin Zhou
Xiao Li
Chong You
Jeremias Sulam
Qing Qu
69
205
0
06 May 2021
Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse
  in Imbalanced Training
Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training
Cong Fang
Hangfeng He
Qi Long
Weijie J. Su
FAtt
167
172
0
29 Jan 2021
Implicit Regularization in ReLU Networks with the Square Loss
Implicit Regularization in ReLU Networks with the Square Loss
Gal Vardi
Ohad Shamir
55
51
0
09 Dec 2020
Neural collapse with unconstrained features
Neural collapse with unconstrained features
D. Mixon
Hans Parshall
Jianzong Pi
72
121
0
23 Nov 2020
A Mathematical Exploration of Why Language Models Help Solve Downstream
  Tasks
A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Nikunj Saunshi
Sadhika Malladi
Sanjeev Arora
82
88
0
07 Oct 2020
Prevalence of Neural Collapse during the terminal phase of deep learning
  training
Prevalence of Neural Collapse during the terminal phase of deep learning training
Vardan Papyan
Xuemei Han
D. Donoho
205
578
0
18 Aug 2020
Gradient descent follows the regularization path for general losses
Gradient descent follows the regularization path for general losses
Ziwei Ji
Miroslav Dudík
Robert Schapire
Matus Telgarsky
AI4CEFaML
138
62
0
19 Jun 2020
Directional convergence and alignment in deep learning
Directional convergence and alignment in deep learning
Ziwei Ji
Matus Telgarsky
59
171
0
11 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
838
42,332
0
28 May 2020
How Contextual are Contextualized Word Representations? Comparing the
  Geometry of BERT, ELMo, and GPT-2 Embeddings
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings
Kawin Ethayarajh
86
875
0
02 Sep 2019
Representation Degeneration Problem in Training Natural Language
  Generation Models
Representation Degeneration Problem in Training Natural Language Generation Models
Jun Gao
Di He
Xu Tan
Tao Qin
Liwei Wang
Tie-Yan Liu
62
270
0
28 Jul 2019
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
Kaifeng Lyu
Jian Li
89
336
0
13 Jun 2019
Stochastic Mirror Descent on Overparameterized Nonlinear Models:
  Convergence, Implicit Regularization, and Generalization
Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization
Navid Azizan
Sahin Lale
B. Hassibi
146
73
0
10 Jun 2019
Analogies Explained: Towards Understanding Word Embeddings
Analogies Explained: Towards Understanding Word Embeddings
Carl Allen
Timothy M. Hospedales
69
144
0
28 Jan 2019
Towards Understanding Linear Word Analogies
Towards Understanding Linear Word Analogies
Kawin Ethayarajh
David Duvenaud
Graeme Hirst
67
116
0
11 Oct 2018
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Implicit Bias of Gradient Descent on Linear Convolutional Networks
Suriya Gunasekar
Jason D. Lee
Daniel Soudry
Nathan Srebro
MDE
124
413
0
01 Jun 2018
Risk and parameter convergence of logistic regression
Risk and parameter convergence of logistic regression
Ziwei Ji
Matus Telgarsky
73
130
0
20 Mar 2018
Convergence of Gradient Descent on Separable Data
Convergence of Gradient Descent on Separable Data
Mor Shpigel Nacson
Jason D. Lee
Suriya Gunasekar
Pedro H. P. Savarese
Nathan Srebro
Daniel Soudry
76
169
0
05 Mar 2018
Characterizing Implicit Bias in Terms of Optimization Geometry
Characterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar
Jason D. Lee
Daniel Soudry
Nathan Srebro
AI4CE
73
410
0
22 Feb 2018
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang
Zihang Dai
Ruslan Salakhutdinov
William W. Cohen
BDL
71
372
0
10 Nov 2017
The Implicit Bias of Gradient Descent on Separable Data
The Implicit Bias of Gradient Descent on Separable Data
Daniel Soudry
Elad Hoffer
Mor Shpigel Nacson
Suriya Gunasekar
Nathan Srebro
161
921
0
27 Oct 2017
All-but-the-Top: Simple and Effective Postprocessing for Word
  Representations
All-but-the-Top: Simple and Effective Postprocessing for Word Representations
Jiaqi Mu
S. Bhat
Pramod Viswanath
79
311
0
05 Feb 2017
Distributed Representations of Words and Phrases and their
  Compositionality
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov
Ilya Sutskever
Kai Chen
G. Corrado
J. Dean
NAIOCL
399
33,550
0
16 Oct 2013
Efficient Estimation of Word Representations in Vector Space
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
680
31,538
0
16 Jan 2013
Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear
  Norm Minimization
Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization
Benjamin Recht
Maryam Fazel
P. Parrilo
419
3,771
0
28 Jun 2007
1