Provable Long-Range Benefits of Next-Token Prediction

8 December 2025

Xinyuan Cao

Santosh S. Vempala

AI4TS

RALM

ArXiv (abs)PDF HTML

Main:55 Pages

10 Figures

Bibliography:4 Pages

1 Tables

Appendix:7 Pages

Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$ , can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$ , independent of the document length) on the model size needed to achieve such $k$ -token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

View on arXiv

Comments on this paper