Description
As I was testing MPO on the cartpole environment, I noticed the algorithm was pretty unstable and has trouble stabilizing at the 200 returns policy. I eventually thought about the tanh
normalizer that comes by default with the GaussianNetwork
(remember, it used to be "mandatory" before #592). To be honest, I did break things in #592 by moving the normalization but it was incorrect before anyways.
However doing that makes learning unstable because the logpdf computed during training is not correct anymore, due to
where N denotes the normal distribution.
Here's a comparison of two runs on cartpole with MPO (sorry the REPL messed up the plots):
The above one is when tanh is applied to actions at sample time, the bottom one is when tanh is applied to the actions at interaction time (by wrapping the environment into an ActionTransformedEnv
). While this is not staggering, it does improve stability of convergence. The effect might be more pronounced for more complex tasks than cartpole.
So I'd like to make a case to simply remove normalizer from GaussianNetwork
:
- It improves stability.
- You can always recover the old behavior by adding the tanh activation to your neural net's output layer.
- It's mathematically more correct.
- You can normalize using
ActionTransformedEnv
.
This is technically a "breaking" change, though I'd call it a "repairing change". But I think it's a change that should be done.
If you agree with me, I can incorporate that in #604.