In our case p=π1, the new policy we want to evaluate and q=π2, the old policy under which the episodes were generated.
where G(hj) is the total discounted sum of rewards for history hj.
Temporal Difference (TD) learning
Combines bootstrapping (DP) and sampling (Monte Carlo).
Vπ(st)←Vπ(st)+α(Gt−Vπ(st))
=Vπ(st)+α(rt+γVπ(st+1)−Vπ(st))
We bootstrap the total returns with the next state's estimate: rt+γVπ(st+1) (called the TD target).
δt=rt+γVπ(st+1)−Vπ(st) is called the TD error.
Since we don't use the total sum of discounted rewards Gt we don't have to wait for the episode to finish.
the reason why we are able to backup over one timestep in DP is because of the Markov assumption. We can use incremental Monte-Carlo with α>1/N in non markovian settings.
we proved DP converges to the true value function by proving that the Bellman backup operator is a contraction
in TD learning, we bootstrap the next state's value to get an estimate of the current state. It is thus biased towards the next state's value.
TD learning is also less data efficient than Monte-Carlo. Since the information only comes from next state (vs the total discounted sum of rewards of the entire episode), we neet to experience an episode L times (L being the length of the episode) to get the full information back to the starting state (especially impactful if high reward is experienced at the end).
Batch Monte-Carlo and TD
Monte-Carlo batch setting: value converges value that minimizes mean squared error (since the minimizer of ∑(yi−y^)2 is y^=∑yi)
TD batch setting does not converge to the same value: