Stanford CS234 Reinforcement Learning (Lecture 5)

Value function approximation

So far we represented value function by a lookup table (each state has a corresponding entry V(s)V(s)). Doesn't generalize well to problems with very large state / action spaces.

Can approximate value functions with linear models, neural nets, decision trees, nearest neighbors, fourier / wavelet bases (look it up).

State can be represented by a feature vector and we minimize the mean squared error objective: J(w)=Eπ[(vπ(s)v^(s,w))2]J(w) = \mathbb{E}_\pi [(v_\pi(s) - \hat v(s, w))^2] via stochastic gradient descent.

where vπ(s)v_\pi(s) is the target value function. If I understand correctly, the expression for this target value function depends on the estimation method used.

For Q learning, vπ(s)v_\pi(s) is replaced with Q(s,a)Q(s,a).

Monte Carlo with VFA

replace our value function with the model. Although returns are unbiased estimates, they are often very noisy.

Converges to the weights with minimum squared error: MSE(w)=minwE[(vv^)2]=sSp(s)(vπ(s)v^π(s,w))2MSE(w)=\min_w \mathbb{E}[(v - \hat v)^2] = \sum_{s\in S} p(s)(v^\pi(s) - \hat v^\pi(s, w))^2 where p(s)p(s) is the stationary distribution of states under policy π\pi.

Temporal difference with VFA

replace update equation Vπ(s)=Vπ(s)+α(r+γVπ(s)Vπ(s))V^\pi(s)=V^\pi(s) + \alpha(r+\gamma V^\pi(s') -V^\pi(s)) with V^π(s,w)=V^π(s,w)+α(r+γV^π(s,w)V^π(s,w))wV^π(s,w)\hat V^\pi (s,w) = \hat V^\pi(s,w) +\alpha (r+\gamma \hat V^\pi (s', w) - \hat V^\pi(s, w))\nabla_w \hat V^\pi(s,w)

For linear regression, wV^π(s,w)=x(s)\nabla_w \hat V^\pi(s,w)=x(s), the feature vector of state ss. Despite biased and approximate estimate of the true value Vπ(s)V^\pi(s), linear TD converges to weights within a constant factor of the minimum min squared error: 11γMSE\frac{1}{1-\gamma}MSE.

Try to prove convergence guarantees

Nonlinear VFA doesn't have convergence guarantees for any of the methods. (why?)

Control using VFA

define function approximators for action-values and define objective function J(w)=Eπ[(qπ(s,a)q^π(s,a,w))2]J(w)=\mathbb{E}_\pi[(q_\pi(s,a) - \hat q_\pi(s,a, w))^2]

Δw=dJ(w)/dw=2E[dq^π(s,a)dw(qπ(s,a)q^π(s,a,w))]\Delta w = dJ(w)/dw = 2 \mathbb{E} [\frac{d\hat q_\pi(s,a)}{dw} (q_\pi(s, a) - \hat q_\pi(s,a,w))]

Update for Monte Carlo Methods (substitute unknown target qπ(s,a)q_\pi(s,a) return GtG_t): Δw=α(Gtq^(s,a,w))wq^(s,a,w)\Delta w=\alpha(G_t -\hat q(s,a,w))\nabla_w \hat q(s,a,w) Update for SARSA (substitute target with a TD target): Δw=α(r+γq^(s,a,w)q^(s,a,w))wq^(s,a,w)\Delta w=\alpha (r+\gamma \hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w) Update for Q-learning (substitute target with a max TD target): Δw=α(r+γmaxaq^(s,a,w)q^(s,a,w))wq^(s,a,w)\Delta w=\alpha (r+\gamma \max_{a'}\hat q(s',a',w)-\hat q(s,a,w))\nabla_w \hat q(s,a,w)

Look up Baird's counterexample as to why convergence is not guaranteed