Skip to content

Latest commit

 

History

History
31 lines (14 loc) · 1.23 KB

File metadata and controls

31 lines (14 loc) · 1.23 KB

Actor Critic Problems

  • High sensitivity to hyperparameters (especially step size)
  • Outliers in data overwhelm training with noise.
  • High variance (required baseline approaches for variance reduction, otherwise: very bad learning)

There has been some proposed approaches such as TRPO, but still then have their own problems.

Proximal Policy Optimization

  • Easily implementable:

The algorithm, at its purest form, has a vert simple implementation as shown above.
  • Surrorgate loss. The objective function (The expected reward) is modified to a clipped version CLIP.

Note that r_t(theta) is the ratio between the probability of the current action under the current policy over the current action over the previous policy and epsilon is a hypyerparameter (paper sets it to 0.2)

The basic idea behind clipping is to prevent a large probability ratio r(theta), and thus preventing a large policy update can surely harm training. Clipping restricts the probability ratio to be in the range of [1-epsilon, 1+epsilon]