stanford-cs234-reinforcement-learning-lecture-5
Value function approximation
So far we represented value function by a lookup table (each state has a corresponding entry V(s)). Doesn't generalize well to problems with very large state / action spaces.
Can approximate value functions with linear models, neural nets, decision trees, nearest neighbors, fourier / wavelet bases (look it up
).
State can be represented by a feature vector and we minimize the mean squared error objective: J(w)=Eπ[(vπ(s)−v^(s,w))2] via stochastic gradient descent.
where vπ(s) is the target value function. If I understand correctly, the expression for this target value function depends on the estimation method used.
For Q learning, vπ(s) is replaced with Q(s,a).
Monte Carlo with VFA
replace our value function with the model. Although returns are unbiased estimates, they are often very noisy.
Converges to the weights with minimum squared error: MSE(w)=minwE[(v−v^)2]=∑s∈Sp(s)(vπ(s)−v^π(s,w))2 where p(s) is the stationary distribution of states under policy π.
Temporal difference with VFA
replace update equation Vπ(s)=Vπ(s)+α(r+γVπ(s′)−Vπ(s)) with V^π(s,w)=V^π(s,w)+α(r+γV^π(s′,w)−V^π(s,w))∇wV^π(s,w)
For linear regression, ∇wV^π(s,w)=x(s), the feature vector of state s. Despite biased and approximate estimate of the true value Vπ(s), linear TD converges to weights within a constant factor of the minimum min squared error: 1−γ1MSE.
Try to prove convergence guarantees
Nonlinear VFA doesn't have convergence guarantees for any of the methods. (why?
)
Control using VFA
define function approximators for action-values and define objective function J(w)=Eπ[(qπ(s,a)−q^π(s,a,w))2]
Δw=dJ(w)/dw=2E[dwdq^π(s,a)(qπ(s,a)−q^π(s,a,w))]
Update for Monte Carlo Methods (substitute unknown target qπ(s,a) return Gt): Δw=α(Gt−q^(s,a,w))∇wq^(s,a,w)
Update for SARSA (substitute target with a TD target): Δw=α(r+γq^(s′,a′,w)−q^(s,a,w))∇wq^(s,a,w)
Update for Q-learning (substitute target with a max TD target): Δw=α(r+γmaxa′q^(s′,a′,w)−q^(s,a,w))∇wq^(s,a,w)
Look up Baird's counterexample as to why convergence is not guaranteed