stanford-cs234-reinforcement-learning-lecture-5

# Value function approximation

So far we represented value function by a **lookup table** (each state has a corresponding entry $V(s)$). Doesn't generalize well to problems with very large state / action spaces.

Can approximate value functions with linear models, neural nets, decision trees, nearest neighbors, fourier / wavelet bases (`look it up`

).

State can be represented by a feature vector and we minimize the mean squared error objective: $J(w) = \mathbb{E}_\pi [(v_\pi(s) - \hat v(s, w))^2]$ via stochastic gradient descent.

where $v_\pi(s)$ is the target value function. If I understand correctly, the expression for this target value function depends on the estimation method used.

For Q learning, $v_\pi(s)$ is replaced with $Q(s,a)$.

## Monte Carlo with VFA

replace our value function with the model. Although returns are unbiased estimates, they are often very noisy.

Converges to the weights with minimum squared error: $MSE(w)=\min_w \mathbb{E}[(v - \hat v)^2] = \sum_{s\in S} p(s)(v^\pi(s) - \hat v^\pi(s, w))^2$ where $p(s)$ is the stationary distribution of states under policy $\pi$.

## Temporal difference with VFA

replace update equation $V^\pi(s)=V^\pi(s) + \alpha(r+\gamma V^\pi(s') -V^\pi(s))$ with $\hat V^\pi (s,w) = \hat V^\pi(s,w) +\alpha (r+\gamma \hat V^\pi (s', w) - \hat V^\pi(s, w))\nabla_w \hat V^\pi(s,w)$

For linear regression, $\nabla_w \hat V^\pi(s,w)=x(s)$, the feature vector of state $s$. Despite biased and approximate estimate of the true value $V^\pi(s)$, linear TD converges to weights within a constant factor of the minimum min squared error: $\frac{1}{1-\gamma}MSE$.

`Try to prove convergence guarantees`

Nonlinear VFA doesn't have convergence guarantees for any of the methods. (`why?`

)

## Control using VFA

define function approximators for action-values and define objective function $J(w)=\mathbb{E}_\pi[(q_\pi(s,a) - \hat q_\pi(s,a, w))^2]$

$\Delta w = dJ(w)/dw = 2 \mathbb{E} [\frac{d\hat q_\pi(s,a)}{dw} (q_\pi(s, a) - \hat q_\pi(s,a,w))]$

**Update for Monte Carlo Methods** (substitute unknown target $q_\pi(s,a)$ return $G_t$): $\Delta w=\alpha(G_t -\hat q(s,a,w))\nabla_w \hat q(s,a,w)$
**Update for SARSA** (substitute target with a TD target): $\Delta w=\alpha (r+\gamma \hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w)$
**Update for Q-learning** (substitute target with a max TD target): $\Delta w=\alpha (r+\gamma \max_{a'}\hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w)$

`Look up Baird's counterexample as to why convergence is not guaranteed`