ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1805.00869
9
18

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

2 May 2018
Yann Ollivier
ArXivPDFHTML
Abstract

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with \emph{nonlinear} parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is \emph{reversible}. This holds even with nonlinear approximations. A policy with transition probabilities P(s,s′)P(s,s')P(s,s′) between states is reversible if there exists a function μ\muμ over states such that P(s,s′)P(s′,s)=μ(s′)μ(s)\frac{P(s,s')}{P(s',s)}=\frac{\mu(s')}{\mu(s)}P(s′,s)P(s,s′)​=μ(s)μ(s′)​. In particular, every move can be undone with some probability. This condition is restrictive; it is satisfied, for instance, for a navigation problem in any unoriented graph. In this case, approximate TD is exactly a gradient descent of the \emph{Dirichlet norm}, the norm of the difference of \emph{gradients} between the true and approximate value functions. The Dirichlet norm also controls the bias of approximate policy gradient. These results hold even with no decay factor (γ=1\gamma=1γ=1) and do not rely on contractivity of the Bellman operator, thus proving stability of TD even with γ=1\gamma=1γ=1 for reversible policies.

View on arXiv
Comments on this paper