Stanford CS234 Reinforcement Learning (Lecture 2)

Assumptions

  • Markov Property: P(sis0,,si1)=P(sisi1)P(s_i\vert s_0,\dots, s_{i-1})=P(s_i\vert s_{i-1})
  • Finite state space: S<\lvert S\rvert < \infty
  • Stationary transition probabilities (time-independent): P(si=ssi1=s)=P(sj=ssj1=s)P(s_i=s'\vert s_{i-1}=s)=P(s_j=s'\vert s_{j-1}=s). We can write it as a transition probability matrix PP of size S×S\lvert S\rvert \times \lvert S \rvert where Pij=P(ji)P_{ij} = P(j\vert i).

Markov Reward Process

add:

  • R:SRR:S\to\mathbb{R}: reward function mapping states to rewards
  • γ[0,1]\gamma \in [0, 1] discount factor

Expected reward: R(s)=E[r0s0=s]R(s)=\mathbb{E}[r_0\vert s_0=s]

Stationary rewards assumption:

  • si=sjri=rjs_i=s_j\Rightarrow r_i=r_j (same state implies same reward)
  • For stochastic rewards: CDF(risi=s)=CDF(rjsj=s)CDF(r_i\vert s_i=s)=CDF(r_j\vert s_j=s) and we have R(s)=E[r0s0=s]=E[r1s1=s]=R(s)=\mathbb{E}[r_0\vert s_0=s]=\mathbb{E}[r_1\vert s_1=s]=\dots (same state, regardless of step, implies same CDF and therefore same expected reward)

Q: A condition on the CDF implies a condition on the distribution itself

A condition on the Cumulative Distribution Function (CDF) is equivalent to a condition on the distribution itself. The CDF is a representation of the probability distribution of a random variable.

The reason the CDF might be used instead of the Probability Density Function (PDF) or Probability Mass Function (PMF) is that the CDF is defined for all kinds of random variables—discrete, continuous, and mixed—while the PDF or PMF is only defined for continuous or discrete random variables, respectively.

This condition therefore implies the condition that expected rewards be constant.

  • Horizon: number of time steps in each episode (finite or infinite)
  • Return: sum of discounted rewards Gt=i=tH1γitri0tH1G_t=\sum_{i=t}^{H-1}\gamma^{i-t}r_i \forall 0\leq t \leq H-1
  • State value function: Vt(s)=E[Gtst=s]V_t(s)=\mathbb{E}[G_t\vert s_t=s] (expected return starting from ss).

Infinite horizon (with γ<1\gamma < 1) + stationary rewards V(s)=V0(s)\Rightarrow V(s)=V_0(s) (stationary state value function).

Computing the state value function

Monte Carlo Simulation

lecture2_state_value_fn_monte_carlo_simulation.png

Analytic solution

lecture2_state_value_fn_analytic_solution.png

As S<\lvert S\rvert < \infty we can write: V=R+γPVV=R+\gamma PV which has the analytical solution V=(IγP)1RV = (I-\gamma P)^{-1}R (complexity in S3\lvert S\rvert^3).

(this assumes that we have the transition probabilities and rewards)

Dynamic Programming solution

Finite horizon:

  • start with VH(s)=0V_H(s)=0 (by definition, there are no future steps to consider, so the expected sum of future rewards is zero)
  • iterate back to 00 with Vt(s)=R(s)+sSP(ss)Vt+1(s)V_t(s)=R(s)+\sum_{s'\in S}P(s'\vert s)V_{t+1}(s')

Infinite horizon:

  • we know VV must be time-independent
  • start with two value function estimates V(s)0sV'(s)\leftarrow 0 \forall s and V(s)sV(s)\leftarrow \infty \forall s
  • update estimate V=R+γPVV' = R + \gamma PV until VVϵ\lVert V' - V\rVert_\infty \leq \epsilon (infinity norm is the component with largest absolute value) and return VV'
  • see p.8 for proof of correctness

Complexity: S2\lvert S\rvert^2

Markov Decision Process

add AA: finite set of actions available from each state ss.

Stationary transition probabilities:

P(si=ssi1=s,ai1=a)=P(sj=ssj1,aj1=a)P(s_i=s'\vert s_{i-1}=s, a_{i-1}=a)=P(s_j=s'\vert s_{j-1}, a_{j-1}=a) (where action precedes state)

Stationary rewards:

R(s,a)=E[risi=s,ai=a]iR(s,a)=\mathbb{E}[r_i\vert s_i=s, a_i=a] \forall i

Policy: probability distribution over actions given current state. Can vary with time: π=(π0,π1,...)\pi = (\pi_0, \pi_1, ...) (where subscript denotes timestep)

  • State value function: Vtπ(s)=Eπ[Gtst=s]V_t^{\pi}(s)=\mathbb{E}_\pi[G_t\vert s_t=s]. When horizon is infinite, state value function is stationary and Vπ(s)=V0π(s)V^\pi(s)=V_0^{\pi}(s)
  • State action value function: Qtπ(s,a)=E[Gtst=s,at=a]Q_t^{\pi}(s,a)=\mathbb{E}[G_t\vert s_t=s, a_t=a]. Meaning we compute expected return given that we are in state ss and take action aa. With infinite horizon: Qπ(s,a)=Q0π(s,a)Q^{\pi}(s,a)=Q_0^{\pi}(s,a)

Computing state action value function for infinite horizon:

Qπ(s,a)=R(s,a)+γsSP(ss,a)Vπ(s)Q^{\pi}(s,a) = R(s,a) + \gamma \sum_{s'\in S}P(s'\vert s,a)V^{\pi}(s')

A stationary policy (i.e. π=π0=π1=\pi=\pi_0=\pi_1=\dots) has an equivalent markov reward process (by taking expectation over action space):

Rπ(s)=aAπ(as)R(s,a)R^{\pi}(s)=\sum_{a\in A} \pi(a\vert s)R(s,a)

Pπ(ss)=aAπ(as)P(ss,a)P^{\pi}(s'\vert s)=\sum_{a\in A} \pi(a\vert s) P(s'\vert s,a)

Can then use previous techniques to evaluate value function (we call it policy evaluation).

MDP control for infinite horizon

π\pi^* optimal iff Vtπ(s)Vtπ(s)t,sV_t^{\pi^*}(s) \geq V_t^{\pi}(s) \forall t,s.

For infinite horizon MDP, existence of an optimal policy implies existence of a stationary optimal policy (i.e. we only need to consider stationary policies). Intuitively, if π=(π0,π1,)\pi^{*} = (\pi_0^{*}, \pi_1^{*}, \dots) is an optimal (non-stationary) policy, we can compute VπiV^{\pi_i^{*}} which is independent of time (because horizon is infinite), for all ii. By the definition of an optimal policy, we can just keep the πi\pi_i^{*} that has the maximum state value function.

Moreover, there is an optimal deterministic policy π^\hat\pi such that Vπ^(s)Vπ(s)V^{\hat\pi}(s) \geq V^{\pi}(s) for all states ss. We can construct it from a stationary policy:

π^(s)=argmaxaA[R(s,a)+γsSP(ss,a)Vπ(s)],sS\hat\pi(s) = \arg\max_{a\in A}\big[R(s,a) + \gamma \sum_{s'\in S}P(s'\vert s, a)V^{\pi}(s')\big], \forall s \in S

Search for an optimal policy has been reduced to the set of deterministic stationary policies (there's AS\vert A\vert^{\vert S\vert} possibilities).

Policy search

lecture2_policy_search.png

Brute force algo. Terminates because it checks all AS\vert A\vert^{\vert S\vert} policies.

Policy iteration

more efficient.

lecture2_policy_improvement.png

lecture2_policy_iteration.png

Proof of correctness: https://stats.stackexchange.com/questions/272777/policy-and-value-iteration-algorithm-convergence-conditions/299950#299950

The value functions at every iteration are non-decreasing. If we cannot improve our policy further, it means:

π(s)=argmaxaA[R(s,a)+γsSP(ss,a)Vπ(s)],sS\pi(s) = \arg\max_{a\in A}[R(s,a) + \gamma \sum_{s' \in S} P(s'\vert s,a)V^{\pi}(s')], \forall s\in S

thus:

Vπ(s)=maxaA[R(s,a)+γsSP(ss,a)Vπ(s)V^{\pi}(s) = \max_{a\in A}[R(s,a) + \gamma \sum_{s'\in S}P(s'\vert s, a)V^{\pi}(s')

which is the Bellman optimality equation.

Value iteration

We look for a fixed value function instead of a fixed policy.

lecture2_value_iteration.png

MDP control for finite horizon

In the finite horizon state, there's an optimal deterministic policy but it is no longer stationary (at each time tt, the policy is different).

lecture2_value_iteration_finite_horizon.png