The leading contenders for learning policies or state-value functions parameterized by a neural network:
To optimize policies, they alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.
To obtain a policy gradient, we differentiate the objective:
where estimates the advantage function at time step . is the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization. Using the same trajectory to perform multiple steps of optimization leads to destructively large policy updates.
in trust region policy optimization (TRPO), we enforce a constraint on the size of the policy update:
Denoting the probability ratio , they propose the clipped objective:
The first term inside the is the TRPO objective, the second term clips the probability ratio to the interval .
We take the minimum to form a pessimistic bound (recall we want to maximize the objective), meaning the original objective is ignored when it would make the objective improve.
They run the policy for time steps and compute the advantage function, looking back steps:
(this is because but can be "bootstrapped" by using the definition of the value function until we reach timestep )
Each of parallel actors collect time steps, the loss is constructed on these time steps of data and is optimized with minibatch Adam. This resembles back-propagation through time, used in recurrent neural networks.
In their experiments, the policy is parameterized by a fully-connected MLP with 2 hidden layers of 64 units, tanh activations, outputting the mean of a Gaussian distribution, with variable standard deviations. Parameters between the policy and value function are not shared.
On MuJoCo and high-dimensional continuous control problems, PPO converges much faster. They mostly compare against A2C and ACER (which performs well, esp. on Atari).
Q
: why do they say PPO is actor-critic style?