Stanford CS234 Reinforcement Learning (Lecture 3)

Model Free Evaluation

  • we don't know the transition probabilities and rewards
  • assume infinite horizon, stationary rewards, transition probabilities and policies

In dynamic programming, we bootstrap/estimate the value of the next state value using our current estimate Vk1(s)V_{k-1}(s') until convergence.

Monte-Carlo on-policy evaluation

Works for episodic environments.

  1. Sample lots of episodes hjh_j until termination

  2. go over the episodes and keep track of:

    • N(s)N(s): number of times we were in each step

    • S(s)S(s): sum of total returns starting from the current step (e.g. for episode jj, timestep tt, S(sj,t)+=Gt=i=thjγitriS(s_{j,t}) += G_t = \sum_{i=t}^{\lvert h_j\rvert} \gamma^{i-t}r_i)

  3. For each state ss, V(s)=S(s)/N(s)V(s)=S(s)/N(s)

We can also keep an incremental estimate of V(s)V(s):

V(s)=V(s)+1N(s)α(GtV(s))V(s) = V(s) + \underbrace{\frac{1}{N(s)}}_{\alpha}(G_t - V(s))

When α>1/N\alpha > 1/N, it gives higher weight to newer data, which helps in non-stationary settings.

Two variations:

  • first-visit monte carlo (only update sate counters the first time we visit the state within the episode)
  • every-visit monte carlo (more data efficient in a true Markovian setting)

Monte-Carlo off-policy evaluation

off-policy = use data taken from one policy to evaluate another

in costly or high stake situations we are not able to obtain rollouts of GtG_t under the new policy.

Importance sampling

Exq[f(x)]=xq(x)f(x)dx=xp(x)q(x)/p(x)f(x)dx=Exp[qpf]\mathbb{E}_{x\sim q}[f(x)] = \int_x q(x)f(x)dx = \int_x p(x)\frac{q(x)/p(x)}f(x)dx = \mathbb{E}_{x\sim p}[\frac{q}{p}f]

In our case p=π1p = \pi_1, the new policy we want to evaluate and q=π2q = \pi_2, the old policy under which the episodes were generated.


where G(hj)G(h_j) is the total discounted sum of rewards for history hjh_j.

Temporal Difference (TD) learning

Combines bootstrapping (DP) and sampling (Monte Carlo).

Vπ(st)Vπ(st)+α(GtVπ(st))V^\pi(s_t)\leftarrow V^\pi(s_t) + \alpha (G_t - V^\pi(s_t))

=Vπ(st)+α(rt+γVπ(st+1)Vπ(st))= V^\pi(s_t) + \alpha (r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t))

We bootstrap the total returns with the next state's estimate: rt+γVπ(st+1)r_t + \gamma V^\pi(s_{t+1}) (called the TD target).

δt=rt+γVπ(st+1)Vπ(st)\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) is called the TD error.

Since we don't use the total sum of discounted rewards GtG_t we don't have to wait for the episode to finish.


  • the reason why we are able to backup over one timestep in DP is because of the Markov assumption. We can use incremental Monte-Carlo with α>1/N\alpha > 1/N in non markovian settings.
  • we proved DP converges to the true value function by proving that the Bellman backup operator is a contraction
  • in TD learning, we bootstrap the next state's value to get an estimate of the current state. It is thus biased towards the next state's value.
  • TD learning is also less data efficient than Monte-Carlo. Since the information only comes from next state (vs the total discounted sum of rewards of the entire episode), we neet to experience an episode LL times (LL being the length of the episode) to get the full information back to the starting state (especially impactful if high reward is experienced at the end).

Batch Monte-Carlo and TD

  • Monte-Carlo batch setting: value converges value that minimizes mean squared error (since the minimizer of (yiy^)2\sum (y_i - \hat y)^2 is y^=yi\hat y = \sum y_i)
  • TD batch setting does not converge to the same value: