Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

11 May 2025

Papers citing "Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures"

23 / 23 papers shown

Title
How Compositional Generalization and Creativity Improve as Diffusion Models are Trained Alessandro Favero Antonio Sclocchi Francesco Cagnetta Pascal Frossard Matthieu Wyart DiffM CoGe 83 6 0 17 Feb 2025
A distributional simplicity bias in the learning dynamics of transformers Riccardo Rende Federica Gerace Alessandro Laio Sebastian Goldt 120 9 0 17 Feb 2025
Probing the Latent Hierarchical Structure of Data via Diffusion Models Antonio Sclocchi Alessandro Favero Noam Itzhak Levi Matthieu Wyart DiffM 95 5 0 17 Oct 2024
How Feature Learning Can Improve Neural Scaling Laws Blake Bordelon Alexander B. Atanasov Cengiz Pehlevan 109 17 0 26 Sep 2024
How transformers learn structured data: insights from hierarchical filtering Jerome Garnier-Brun Marc Mézard Emanuele Moscato Luca Saglietti 134 6 0 27 Aug 2024
What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages Nadav Borenstein Anej Svete R. Chan Josef Valvoda Franz Nowak Isabelle Augenstein Eleanor Chodroff Ryan Cotterell 77 13 0 06 Jun 2024
Transformers represent belief state geometry in their residual stream A. Shai Sarah E. Marzen Lucas Teixeira Alexander Gietelink Oldenziel P. Riechers AI4CE 67 17 0 24 May 2024
A Dynamical Model of Neural Scaling Laws Blake Bordelon Alexander B. Atanasov Cengiz Pehlevan 96 44 0 02 Feb 2024
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time Yatin Dandi Florent Krzakala Bruno Loureiro Luca Pesce Ludovic Stephan MLT 95 29 0 29 May 2023
Physics of Language Models: Part 1, Learning Hierarchical Language Structures Zeyuan Allen-Zhu Yuanzhi Li 112 21 0 23 May 2023
Learning Single-Index Models with Shallow Neural Networks A. Bietti Joan Bruna Clayton Sanford M. Song 203 71 0 27 Oct 2022
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation Jimmy Ba Murat A. Erdogdu Taiji Suzuki Zhichao Wang Denny Wu Greg Yang MLT 91 129 0 03 May 2022
Training Compute-Optimal Large Language Models Jordan Hoffmann Sebastian Borgeaud A. Mensch Elena Buchatskaya Trevor Cai ... Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals Laurent Sifre AI4TS 208 1,987 0 29 Mar 2022
Locality defeats the curse of dimensionality in convolutional teacher-student scenarios Alessandro Favero Francesco Cagnetta Matthieu Wyart 79 31 0 16 Jun 2021
Explaining Neural Scaling Laws Yasaman Bahri Ethan Dyer Jared Kaplan Jaehoon Lee Utkarsh Sharma 78 269 0 12 Feb 2021
Learning Curve Theory Marcus Hutter 218 64 0 08 Feb 2021
Geometric compression of invariant manifolds in neural nets J. Paccolat Leonardo Petrini Mario Geiger Kevin Tyloo Matthieu Wyart MLT 103 36 0 22 Jul 2020
Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks Blake Bordelon Abdulkadir Canatar Cengiz Pehlevan 256 208 0 07 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 611 4,921 0 23 Jan 2020
On Lazy Training in Differentiable Programming Lénaïc Chizat Edouard Oyallon Francis R. Bach 111 840 0 19 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 1.8K 95,229 0 11 Oct 2018
Convolutional Neural Networks for Sentence Classification Yoon Kim AILaw VLM 641 13,432 0 25 Aug 2014
A Convolutional Neural Network for Modelling Sentences Nal Kalchbrenner Edward Grefenstette Phil Blunsom 109 3,559 0 08 Apr 2014