stanford-cs234-reinforcement-learning-lecture-2
Assumptions
- Markov Property: P(si∣s0,…,si−1)=P(si∣si−1)
- Finite state space: ∣S∣<∞
- Stationary transition probabilities (time-independent): P(si=s′∣si−1=s)=P(sj=s′∣sj−1=s). We can write it as a transition probability matrix P of size ∣S∣×∣S∣ where Pij=P(j∣i).
Markov Reward Process
add:
- R:S→R: reward function mapping states to rewards
- γ∈[0,1] discount factor
Expected reward: R(s)=E[r0∣s0=s]
Stationary rewards assumption:
- si=sj⇒ri=rj (same state implies same reward)
- For stochastic rewards: CDF(ri∣si=s)=CDF(rj∣sj=s) and we have R(s)=E[r0∣s0=s]=E[r1∣s1=s]=… (same state, regardless of step, implies same CDF and therefore same expected reward)
Q: A condition on the CDF implies a condition on the distribution itself
A condition on the Cumulative Distribution Function (CDF) is equivalent to a condition on the distribution itself. The CDF is a representation of the probability distribution of a random variable.
The reason the CDF might be used instead of the Probability Density Function (PDF) or Probability Mass Function (PMF) is that the CDF is defined for all kinds of random variables—discrete, continuous, and mixed—while the PDF or PMF is only defined for continuous or discrete random variables, respectively.
This condition therefore implies the condition that expected rewards be constant.
- Horizon: number of time steps in each episode (finite or infinite)
- Return: sum of discounted rewards Gt=∑i=tH−1γi−tri∀0≤t≤H−1
- State value function: Vt(s)=E[Gt∣st=s] (expected return starting from s).
Infinite horizon (with γ<1) + stationary rewards ⇒V(s)=V0(s) (stationary state value function).
Computing the state value function
Monte Carlo Simulation
Analytic solution
As ∣S∣<∞ we can write: V=R+γPV which has the analytical solution V=(I−γP)−1R (complexity in ∣S∣3).
(this assumes that we have the transition probabilities and rewards)
Dynamic Programming solution
Finite horizon:
- start with VH(s)=0 (by definition, there are no future steps to consider, so the expected sum of future rewards is zero)
- iterate back to 0 with Vt(s)=R(s)+∑s′∈SP(s′∣s)Vt+1(s′)
Infinite horizon:
- we know V must be time-independent
- start with two value function estimates V′(s)←0∀s and V(s)←∞∀s
- update estimate V′=R+γPV until ∥V′−V∥∞≤ϵ (infinity norm is the component with largest absolute value) and return V′
- see p.8 for proof of correctness
Complexity: ∣S∣2
Markov Decision Process
add A: finite set of actions available from each state s.
Stationary transition probabilities:
P(si=s′∣si−1=s,ai−1=a)=P(sj=s′∣sj−1,aj−1=a) (where action precedes state)
Stationary rewards:
R(s,a)=E[ri∣si=s,ai=a]∀i
Policy: probability distribution over actions given current state. Can vary with time: π=(π0,π1,...) (where subscript denotes timestep)
- State value function: Vtπ(s)=Eπ[Gt∣st=s]. When horizon is infinite, state value function is stationary and Vπ(s)=V0π(s)
- State action value function: Qtπ(s,a)=E[Gt∣st=s,at=a]. Meaning we compute expected return given that we are in state s and take action a. With infinite horizon: Qπ(s,a)=Q0π(s,a)
Computing state action value function for infinite horizon:
Qπ(s,a)=R(s,a)+γ∑s′∈SP(s′∣s,a)Vπ(s′)
A stationary policy (i.e. π=π0=π1=…) has an equivalent markov reward process (by taking expectation over action space):
Rπ(s)=∑a∈Aπ(a∣s)R(s,a)
Pπ(s′∣s)=∑a∈Aπ(a∣s)P(s′∣s,a)
Can then use previous techniques to evaluate value function (we call it policy evaluation).
MDP control for infinite horizon
π∗ optimal iff Vtπ∗(s)≥Vtπ(s)∀t,s.
For infinite horizon MDP, existence of an optimal policy implies existence of a stationary optimal policy (i.e. we only need to consider stationary policies). Intuitively, if π∗=(π0∗,π1∗,…) is an optimal (non-stationary) policy, we can compute Vπi∗ which is independent of time (because horizon is infinite), for all i. By the definition of an optimal policy, we can just keep the πi∗ that has the maximum state value function.
Moreover, there is an optimal deterministic policy π^ such that Vπ^(s)≥Vπ(s) for all states s. We can construct it from a stationary policy:
π^(s)=argmaxa∈A[R(s,a)+γ∑s′∈SP(s′∣s,a)Vπ(s′)],∀s∈S
Search for an optimal policy has been reduced to the set of deterministic stationary policies (there's ∣A∣∣S∣ possibilities).
Policy search
Brute force algo. Terminates because it checks all ∣A∣∣S∣ policies.
Policy iteration
more efficient.
Proof of correctness: https://stats.stackexchange.com/questions/272777/policy-and-value-iteration-algorithm-convergence-conditions/299950#299950
The value functions at every iteration are non-decreasing. If we cannot improve our policy further, it means:
π(s)=argmaxa∈A[R(s,a)+γ∑s′∈SP(s′∣s,a)Vπ(s′)],∀s∈S
thus:
Vπ(s)=maxa∈A[R(s,a)+γ∑s′∈SP(s′∣s,a)Vπ(s′)
which is the Bellman optimality equation.
Value iteration
We look for a fixed value function instead of a fixed policy.
MDP control for finite horizon
In the finite horizon state, there's an optimal deterministic policy but it is no longer stationary (at each time t, the policy is different).