-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompute log probability in PPO #430
Conversation
Seems like this PR (PPO ParallelLink Precompute) further improves PPO from #427 . |
Before this PR
After this PR
You can see from the |
After changes in #427 (stop scaling weight initialization at value function and decaying clip eps), I conducted the comparison again, with 10 random seeds each. I confirmed again that this PR improves performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merge #427 before this PR.Before this PR, for every minibatch (e.g. 64 transitions) in a batch (e.g. 2048 transitions) used for an iteration of PPO, both the current and old models are evaluated to compute the probability ratio. This PR instead pre-compute the outputs from the old model and evaluate only the current model for each minibatch. The reasons behind this PR are:
obs_normalizer
is updated after collecting a batch of transitions and before updating the policy. To correctly implement importance sampling, the policy used for collecting transitions should be used to compute importance weights. Thus, it is better to compute the probabilities before updatingobs_normalizer
which can affect the policy. https://github.com/chainer/chainerrl/blob/master/chainerrl/agents/ppo.py#L225neglogpacs
https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py#L33This PR also adds
n_updates
to PPO's statistics to help debugging.Todos