ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.02526
18
27

Prediction with a Short Memory

8 December 2016
Vatsal Sharan
Sham Kakade
Percy Liang
Gregory Valiant
    AI4TS
ArXivPDFHTML
Abstract

We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by III, then a simple Markov model over the most recent I/ϵI/\epsilonI/ϵ observations obtains expected KL error ϵ\epsilonϵ---and hence ℓ1\ell_1ℓ1​ error ϵ\sqrt{\epsilon}ϵ​---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with nnn hidden states, III is bounded by log⁡n\log nlogn, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O(log⁡n/ϵ)O(\log n/\epsilon)O(logn/ϵ) windows of observations achieves this error, provided the length of the sequence is dΩ(log⁡n/ϵ)d^{\Omega(\log n/\epsilon)}dΩ(logn/ϵ), where ddd is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with nnn hidden states, a window length of log⁡n/ϵ\log n/\epsilonlogn/ϵ is information-theoretically necessary to achieve expected ℓ1\ell_1ℓ1​ error ϵ\sqrt{\epsilon}ϵ​. Second, the dΘ(log⁡n/ϵ)d^{\Theta(\log n/\epsilon)}dΘ(logn/ϵ) samples required to estimate the Markov model for an observation alphabet of size ddd is necessary for any computationally tractable learning algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

View on arXiv
Comments on this paper