proximal-policy-optimization

The leading contenders for learning policies or state-value functions parameterized by a neural network:

  • deep Q-learning: fails on simple problems (e.g. continuous control benchmarks)
  • "vanilla" policy gradient methods: poor data efficiency and robustness
  • trust region policy gradient methods: too complicated, not compatible with architectures that include noise (s.a. dropout) or parameter sharing

To optimize policies, they alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.

recap: policy gradient

To obtain a policy gradient, we differentiate the objective: Et^[logπθ(atst)A^t]\hat{\mathbb{E}_t}[\log\pi_\theta(a_t \vert s_t)\hat{A}_t]

where A^t\hat{A}_t estimates the advantage function at time step tt. Et^\hat{\mathbb{E}_t} is the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization. Using the same trajectory to perform multiple steps of optimization leads to destructively large policy updates.

recap: TRPO

in trust region policy optimization (TRPO), we enforce a constraint on the size of the policy update:

maxθEt^[πθ(atst)πθold(atst)At^] subject to Et^[KL[πold(st),πθ(st)]]δ\max_\theta \hat{\mathbb{E}_t}\big[ \frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)} \hat{A_t} \big] \text{ subject to } \hat{\mathbb{E}_t}[\text{KL}[\pi_\text{old}(\cdot\vert s_t), \pi_\theta(\cdot\vert s_t)]]\leq \delta

clipped surrogate objective

Denoting the probability ratio rt(θ)=πθ(atst)πold(atst)r_t(\theta)=\frac{\pi_\theta(a_t\vert s_t)}{\pi_\text{old}(a_t\vert s_t)}, they propose the clipped objective:

Et^[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ))A^t]\hat{\mathbb{E}_t}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon))\hat{A}_t]

The first term inside the min\min is the TRPO objective, the second term clips the probability ratio to the interval [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon].

We take the minimum to form a pessimistic bound (recall we want to maximize the objective), meaning the original objective is ignored when it would make the objective improve.

PPO algorithm

They run the policy for TT time steps and compute the advantage function, looking back TT steps:

A^t=V(st)+rt+γrt+1++γTt+1rT1+γTtV(sT)\hat{A}_t = -V(s_t) + r_t + \gamma r_{t+1} + \dots + \gamma^{T-t+1}r_{T-1} + \gamma^{T-t}V(s_T)

(this is because At=V(st+1)V(st)A_t=V(s_{t+1})-V(s_{t}) but V(st+1)V(s_{t+1}) can be "bootstrapped" by using the definition of the value function V(st)=rt+γV(st+1)V(s_t)=r_t+\gamma V(s_{t+1}) until we reach timestep TT)

Each of NN parallel actors collect TT time steps, the loss is constructed on these NTNT time steps of data and is optimized with minibatch Adam. This resembles back-propagation through time, used in recurrent neural networks.

In their experiments, the policy is parameterized by a fully-connected MLP with 2 hidden layers of 64 units, tanh activations, outputting the mean of a Gaussian distribution, with variable standard deviations. Parameters between the policy and value function are not shared.

On MuJoCo and high-dimensional continuous control problems, PPO converges much faster. They mostly compare against A2C and ACER (which performs well, esp. on Atari).

Q: why do they say PPO is actor-critic style?