Abstract
TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.
Keywords
Related Publications
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level hu...
TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play
TD-Gammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results, based on the TD(λ) reinforcement le...
Generalization in Reinforcement Learning: Safely Approximating the Value Function
A straightforward approach to the curse of dimensionality inreinforcement learning and dynamic programming is to replace the lookup table with a generalizing function approximat...
Actor–Critic-Based Optimal Tracking for Partially Unknown Nonlinear Discrete-Time Systems
This paper presents a partially model-free adaptive optimal control solution to the deterministic nonlinear discrete-time (DT) tracking control problem in the presence of input ...
Publication Info
- Year
- 2025
- Type
- article
- Citations
- 7
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.48550/arxiv.2512.08855