97
0
v1v2 (latest)

Towards Understanding the Universality of Transformers for Next-Token Prediction

Abstract

Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token xt+1x_{t+1} given an autoregressive sequence (x1,,xt)(x_1, \dots, x_t) as a prompt, where xt+1=f(xt) x_{t+1} = f(x_t) , and f f is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when f f is linear or when (xt)t1 (x_t)_{t \geq 1} is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping ff in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates xt+1x_{t+1} based solely on past and current observations (x1,,xt) (x_1, \dots, x_t) , with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings ff.

View on arXiv
@article{sander2025_2410.03011,
  title={ Towards Understanding the Universality of Transformers for Next-Token Prediction },
  author={ Michael E. Sander and Gabriel Peyré },
  journal={arXiv preprint arXiv:2410.03011},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.