The leading contenders for learning policies or state-value functions parameterized by a neural network:

- deep Q-learning: fails on simple problems (e.g. continuous control benchmarks)
- "vanilla" policy gradient methods: poor data efficiency and robustness
- trust region policy gradient methods: too complicated, not compatible with architectures that include noise (s.a. dropout) or parameter sharing

To optimize policies, they alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.

To obtain a policy gradient, we differentiate the objective: $\hat{\mathbb{E}_t}[\log\pi_\theta(a_t \vert s_t)\hat{A}_t]$

where $\hat{A}_t$ estimates the advantage function at time step $t$. $\hat{\mathbb{E}_t}$ is the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization. Using the same trajectory to perform multiple steps of optimization leads to destructively large policy updates.

in trust region policy optimization (TRPO), we enforce a constraint on the size of the policy update:

$\max_\theta \hat{\mathbb{E}_t}\big[ \frac{\pi_\theta(a_t\vert s_t)}{\pi_{\theta_\text{old}}(a_t\vert s_t)} \hat{A_t} \big] \text{ subject to } \hat{\mathbb{E}_t}[\text{KL}[\pi_\text{old}(\cdot\vert s_t), \pi_\theta(\cdot\vert s_t)]]\leq \delta$

Denoting the probability ratio $r_t(\theta)=\frac{\pi_\theta(a_t\vert s_t)}{\pi_\text{old}(a_t\vert s_t)}$, they propose the clipped objective:

$\hat{\mathbb{E}_t}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon))\hat{A}_t]$

The first term inside the $\min$ is the TRPO objective, the second term clips the probability ratio to the interval $[1-\epsilon, 1+\epsilon]$.

We take the minimum to form a pessimistic bound (recall we want to maximize the objective), meaning the original objective is ignored when it would make the objective improve.

They run the policy for $T$ time steps and compute the advantage function, looking back $T$ steps:

$\hat{A}_t = -V(s_t) + r_t + \gamma r_{t+1} + \dots + \gamma^{T-t+1}r_{T-1} + \gamma^{T-t}V(s_T)$

(this is because $A_t=V(s_{t+1})-V(s_{t})$ but $V(s_{t+1})$ can be "bootstrapped" by using the definition of the value function $V(s_t)=r_t+\gamma V(s_{t+1})$ until we reach timestep $T$)

Each of $N$ parallel actors collect $T$ time steps, the loss is constructed on these $NT$ time steps of data and is optimized with minibatch Adam. This resembles back-propagation through time, used in recurrent neural networks.

In their experiments, the policy is parameterized by a fully-connected MLP with 2 hidden layers of 64 units, tanh activations, outputting the mean of a Gaussian distribution, with variable standard deviations. Parameters between the policy and value function are not shared.

On MuJoCo and high-dimensional continuous control problems, PPO converges much faster. They mostly compare against A2C and ACER (which performs well, esp. on Atari).

`Q`

: why do they say PPO is actor-critic style?