Skip to content

tanh normalization destabilizes learning with GaussianNetwork #745

Closed
@HenriDeh

Description

@HenriDeh

As I was testing MPO on the cartpole environment, I noticed the algorithm was pretty unstable and has trouble stabilizing at the 200 returns policy. I eventually thought about the tanh normalizer that comes by default with the GaussianNetwork (remember, it used to be "mandatory" before #592). To be honest, I did break things in #592 by moving the normalization but it was incorrect before anyways.

However doing that makes learning unstable because the logpdf computed during training is not correct anymore, due to

$$N(tanh(y)| mu, sigma) != N(y| tanh(mu), sigma)$$

where N denotes the normal distribution.

Here's a comparison of two runs on cartpole with MPO (sorry the REPL messed up the plots):
image
The above one is when tanh is applied to actions at sample time, the bottom one is when tanh is applied to the actions at interaction time (by wrapping the environment into an ActionTransformedEnv). While this is not staggering, it does improve stability of convergence. The effect might be more pronounced for more complex tasks than cartpole.

So I'd like to make a case to simply remove normalizer from GaussianNetwork:

  • It improves stability.
  • You can always recover the old behavior by adding the tanh activation to your neural net's output layer.
  • It's mathematically more correct.
  • You can normalize using ActionTransformedEnv.

This is technically a "breaking" change, though I'd call it a "repairing change". But I think it's a change that should be done.
If you agree with me, I can incorporate that in #604.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions