Proximal Policy Optimization Algorithms

Actor Critic Problems

High sensitivity to hyperparameters (especially step size)
Outliers in data overwhelm training with noise.
High variance (required baseline approaches for variance reduction, otherwise: very bad learning)

There has been some proposed approaches such as TRPO, but still then have their own problems.

Proximal Policy Optimization

Easily implementable:

The algorithm, at its purest form, has a vert simple implementation as shown above.

Surrorgate loss. The objective function (The expected reward) is modified to a clipped version CLIP.

Note that r_t(theta) is the ratio between the probability of the current action under the current policy over the current action over the previous policy and epsilon is a hypyerparameter (paper sets it to 0.2)

The basic idea behind clipping is to prevent a large probability ratio r(theta), and thus preventing a large policy update can surely harm training. Clipping restricts the probability ratio to be in the range of [1-epsilon, 1+epsilon]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ppo.md

ppo.md

Proximal Policy Optimization Algorithms

Actor Critic Problems

Proximal Policy Optimization

Files

ppo.md

Latest commit

History

ppo.md

File metadata and controls

Proximal Policy Optimization Algorithms

Actor Critic Problems

Proximal Policy Optimization