ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12465
65
1

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

18 February 2025
Dhruv Rohatgi
Adam Block
Audrey Huang
Akshay Krishnamurthy
Dylan J. Foster
ArXiv (abs)PDFHTML
Abstract

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length HHH increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor C≥1C\geq 1C≥1 -- we confirm that CCC indeed grows with HHH for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show:(1) Information-theoretically, one can avoid error amplification and achieve C=O(1)C=O(1)C=O(1).(2) Next-token prediction can be made robust so as to achieve C=O~(H)C=\tilde O(H)C=O~(H), representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer C=Ω(H)C=\Omega(H)C=Ω(H).(3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor C=e(log⁡H)1−Ω(1)C=e^{(\log H)^{1-\Omega(1)}}C=e(logH)1−Ω(1); however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on C=Ω(H)C=\Omega(H)C=Ω(H) in sub-exponential time.Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction.

View on arXiv
@article{rohatgi2025_2502.12465,
  title={ Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification },
  author={ Dhruv Rohatgi and Adam Block and Audrey Huang and Akshay Krishnamurthy and Dylan J. Foster },
  journal={arXiv preprint arXiv:2502.12465},
  year={ 2025 }
}
Comments on this paper