Reinforcement Learning From State and Temporal Differences

Lex Weaver; Jonathan Baxter

doi:10.48550/arxiv.2512.08855

Abstract

TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.

Keywords

Reinforcement learningMonotonic functionState (computer science)Context (archaeology)Q-learningBellman equationComputer scienceApproximation errorState spaceFunction (biology)Differential (mechanical device)Function approximationMathematicsMathematical optimizationControl theory (sociology)Artificial intelligenceAlgorithmStatisticsArtificial neural networkControl (management)

Related Publications

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Stefan Elfwing , Eiji Uchibe , Kenji Doya

In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level hu...

2018 Neural Networks 1643 citations

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

Gerald Tesauro

TD-Gammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results, based on the TD(λ) reinforcement le...

1994 Neural Computation 783 citations

Jonathan Baxter , Andrew Tridgell , Lex Weaver

In this paper we present TDLEAF(λ), a variation on the TD(λ) algorithm that enables it to be used in conjunction with game-tree search. We present some experiments in which our ...

2000 Machine Learning 132 citations

Generalization in Reinforcement Learning: Safely Approximating the Value Function

Justin A. Boyan , Andrew Moore

A straightforward approach to the curse of dimensionality inreinforcement learning and dynamic programming is to replace the lookup table with a generalizing function approximat...

1994 506 citations

Actor–Critic-Based Optimal Tracking for Partially Unknown Nonlinear Discrete-Time Systems

Bahare Kiumarsi , Frank L. Lewis

This paper presents a partially model-free adaptive optimal control solution to the deterministic nonlinear discrete-time (DT) tracking control problem in the presence of input ...

2014 IEEE Transactions on Neural Networks ... 311 citations

Publication Info

Year: 2025
Type: article
Citations: 7
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Reinforcement Learning From State and Temporal Differences

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Lex Weaver, 
                                
                                    Jonathan Baxter
                                
                            (2025). 
                            Reinforcement Learning From State and Temporal Differences. 
                            arXiv (Cornell University)
                            
                            .
                            https://doi.org/10.48550/arxiv.2512.08855

Identifiers

DOI: 10.48550/arxiv.2512.08855