Value function approximation

So far we represented value function by a lookup table (each state has a corresponding entry $V(s)$ ). Doesn't generalize well to problems with very large state / action spaces.

Can approximate value functions with linear models, neural nets, decision trees, nearest neighbors, fourier / wavelet bases (look it up).

State can be represented by a feature vector and we minimize the mean squared error objective: $J(w) = \mathbb{E}_\pi [(v_\pi(s) - \hat v(s, w))^2]$ via stochastic gradient descent.

where $v_\pi(s)$ is the target value function. If I understand correctly, the expression for this target value function depends on the estimation method used.

For Q learning, $v_\pi(s)$ is replaced with $Q(s,a)$ .

Monte Carlo with VFA

replace our value function with the model. Although returns are unbiased estimates, they are often very noisy.

Converges to the weights with minimum squared error: $MSE(w)=\min_w \mathbb{E}[(v - \hat v)^2] = \sum_{s\in S} p(s)(v^\pi(s) - \hat v^\pi(s, w))^2$ where $p(s)$ is the stationary distribution of states under policy $\pi$ .

Temporal difference with VFA

replace update equation $V^\pi(s)=V^\pi(s) + \alpha(r+\gamma V^\pi(s') -V^\pi(s))$ with $\hat V^\pi (s,w) = \hat V^\pi(s,w) +\alpha (r+\gamma \hat V^\pi (s', w) - \hat V^\pi(s, w))\nabla_w \hat V^\pi(s,w)$

For linear regression, $\nabla_w \hat V^\pi(s,w)=x(s)$ , the feature vector of state $s$ . Despite biased and approximate estimate of the true value $V^\pi(s)$ , linear TD converges to weights within a constant factor of the minimum min squared error: $\frac{1}{1-\gamma}MSE$ .

Try to prove convergence guarantees

Nonlinear VFA doesn't have convergence guarantees for any of the methods. (why?)

Control using VFA

define function approximators for action-values and define objective function $J(w)=\mathbb{E}_\pi[(q_\pi(s,a) - \hat q_\pi(s,a, w))^2]$

\Delta w = dJ(w)/dw = 2 \mathbb{E} [\frac{d\hat q_\pi(s,a)}{dw} (q_\pi(s, a) - \hat q_\pi(s,a,w))]

Update for Monte Carlo Methods (substitute unknown target $q_\pi(s,a)$ return $G_t$ ): $\Delta w=\alpha(G_t -\hat q(s,a,w))\nabla_w \hat q(s,a,w)$ Update for SARSA (substitute target with a TD target): $\Delta w=\alpha (r+\gamma \hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w)$ Update for Q-learning (substitute target with a max TD target): $\Delta w=\alpha (r+\gamma \max_{a'}\hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w)$

Look up Baird's counterexample as to why convergence is not guaranteed

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille