In dynamic programming, we bootstrap/estimate the value of the next state value using our current estimate $V_{k-1}(s')$ until convergence.
Works for episodic environments.
Sample lots of episodes $h_j$ until termination
go over the episodes and keep track of:
$N(s)$: number of times we were in each step
$S(s)$: sum of total returns starting from the current step (e.g. for episode $j$, timestep $t$, $S(s_{j,t}) += G_t = \sum_{i=t}^{\lvert h_j\rvert} \gamma^{i-t}r_i$)
For each state $s$, $V(s)=S(s)/N(s)$
We can also keep an incremental estimate of $V(s)$:
$V(s) = V(s) + \underbrace{\frac{1}{N(s)}}_{\alpha}(G_t - V(s))$
When $\alpha > 1/N$, it gives higher weight to newer data, which helps in non-stationary settings.
Two variations:
off-policy = use data taken from one policy to evaluate another
in costly or high stake situations we are not able to obtain rollouts of $G_t$ under the new policy.
$\mathbb{E}_{x\sim q}[f(x)] = \int_x q(x)f(x)dx = \int_x p(x)\frac{q(x)/p(x)}f(x)dx = \mathbb{E}_{x\sim p}[\frac{q}{p}f]$
In our case $p = \pi_1$, the new policy we want to evaluate and $q = \pi_2$, the old policy under which the episodes were generated.
where $G(h_j)$ is the total discounted sum of rewards for history $h_j$.
Combines bootstrapping (DP) and sampling (Monte Carlo).
$V^\pi(s_t)\leftarrow V^\pi(s_t) + \alpha (G_t - V^\pi(s_t))$
$= V^\pi(s_t) + \alpha (r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t))$
We bootstrap the total returns with the next state's estimate: $r_t + \gamma V^\pi(s_{t+1})$ (called the TD target).
$\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$ is called the TD error.
Since we don't use the total sum of discounted rewards $G_t$ we don't have to wait for the episode to finish.