Precompute log probability in PPO #430

muupan · 2019-04-02T19:16:08Z

~~Merge #427 before this PR.~~

Before this PR, for every minibatch (e.g. 64 transitions) in a batch (e.g. 2048 transitions) used for an iteration of PPO, both the current and old models are evaluated to compute the probability ratio. This PR instead pre-compute the outputs from the old model and evaluate only the current model for each minibatch. The reasons behind this PR are:

It is computationally more efficient since it can use a larger batch and only evaluate the old model once per PPO iteration.
Currently, obs_normalizer is updated after collecting a batch of transitions and before updating the policy. To correctly implement importance sampling, the policy used for collecting transitions should be used to compute importance weights. Thus, it is better to compute the probabilities before updating obs_normalizer which can affect the policy. https://github.com/chainer/chainerrl/blob/master/chainerrl/agents/ppo.py#L225
It is what openai/baselines' ppo2 does.
- compute and collect neglogpacs https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py#L33
- use them for updates for multiple epochs https://github.com/openai/baselines/blob/master/baselines/ppo2/ppo2.py#L135

This PR also adds n_updates to PPO's statistics to help debugging.

Todos

check experimental results on mujoco

muupan · 2019-04-02T19:18:40Z

python3 examples/gym/train_ppo_batch_gym.py --num-envs 8 --standardize-advantage --reward-scale-factor 1

Seems like this PR (PPO ParallelLink Precompute) further improves PPO from #427 .

muupan · 2019-04-02T19:25:48Z

Before this PR

steps    episodes  elapsed             mean                median              stdev               max            min            average_value  average_entropy  average_value_loss  average_policy_loss
10000    317       33.05278420448303   119.26321076698106  99.46135015938421   64.14633436909459   206.742831986  39.34816545    28.8124        4.1561           179.033699493       -0.0277710283175
20000    438       59.161741733551025  202.04705024963138  207.06643692475188  23.926879897392748  225.813704292  139.179477616  57.5549        4.02988          192.041610947       -0.0202944352478
30000    540       84.93992352485657   248.44966183489285  239.74714727962163  43.45442877129253   349.278803096  200.578474862  78.1961        4.01778          155.257068253       -0.0263589197583
40000    629       110.46917819976807  293.07378468966334  288.95319182011986  35.43594757755144   339.322909809  231.344559107  101.248        4.00544          93.0595828629       -0.0286226172559
50000    709       137.43406057357788  349.6907400217982   352.4863060409997   37.110468702436016  411.693973651  291.113645881  118.294        3.99487          77.5045942688       -0.0257879035734
60000    777       163.5329647064209   399.8891720979735   370.6360593169091   80.77334633871398   590.520931207  330.953836688  133.662        3.89791          101.940128822       -0.0234231378138
70000    845       189.91302824020386  435.13024215164205  460.14197664624373  76.49955514995682   526.785402951  316.096325134  156.085        3.77472          73.407964592        -0.0219600932673
80000    910       215.85175967216492  498.70224352676354  497.0586440942397   90.42490427340361   608.202483353  366.138922365  167.053        3.65118          118.404946365       -0.0156295545958
90000    973       237.3973081111908   589.8828742033543   589.5297092100292   39.65409710585753   647.804343238  539.193957163  166.685        3.5453           77.9959169006       -0.0227911111154
100000   1029      263.6106791496277   570.1790942858783   576.3806060744045   57.03847087958787   668.19668847   462.345800433  194.823        3.44628          57.4454479218       -0.0279221744463
...
900000   3014      2377.893645763397   2012.6868400010674  2107.3424374709793  683.4873530449688   3180.1303348   917.153039146  338.623        1.68888          17.2852041912       -0.0101115157083
910000   3038      2406.1343142986298  2035.6439387364721  1987.9599298399319  950.4910329791743   3708.32678764  974.920664467  319.62         1.68662          17.8323961973       0.000724212210625
920000   3057      2432.622462749481   1414.055365750642   1335.4124515054782  449.2493123806231   2522.48493052  875.294691457  303.086        1.68668          17.8008523369       -0.0033215672709
930000   3083      2459.105754852295   1992.414083549951   1930.0573599354225  897.2102916448139   3405.88189791  927.061608657  264.884        1.68865          26.193542614        -0.00378504037857
940000   3104      2480.951209783554   1893.3156845116744  1812.9075248126405  544.4118109336616   2932.35884871  1095.29467356  333.352        1.68839          12.2244317818       -0.0040982496459
950000   3126      2507.4752864837646  1757.795236582196   1508.0485472677615  630.0978336284899   3090.61839808  1119.79580558  316.808        1.68313          21.3249677658       -0.000522386953235
960000   3144      2534.8449425697327  1687.2607867072402  1585.7381211790268  518.6957059636674   2964.65389903  1003.84550795  340.303        1.68419          285.061594753       -0.00393328607082
970000   3170      2562.702007293701   2051.4050136707874  1990.991650021245   560.5186701102349   2770.98659038  957.649562976  285.67         1.68361          21.1663979053       -0.00127487655729
980000   3192      2589.9964101314545  1604.787793386926   1572.864998639945   389.19604106912834  2139.81599436  964.185414987  308.506        1.68358          18.0976902294       0.0024951341562
990000   3211      2616.968958377838   1525.1443370039137  1317.2550296443324  687.5921056189163   3195.83500891  974.757126147  340.325        1.68384          526.064132347       0.000202383436263
1000000  3233      2643.955591201782   1643.5590112612065  1662.0497892226383  352.1363894147835   2261.28624284  1004.52636307  310.715        1.68369          21.1909341908       -0.00218982848339

After this PR

steps    episodes  elapsed             mean                median              stdev               max            min            average_value  average_entropy  average_value_loss  average_policy_loss  n_updates
10000    317       30.576662063598633  108.47775352917078  96.49839078161247   48.406727879698714  206.041509971  56.2000098447  28.5875        4.1598           172.865554657       -0.0355601329915     1280
20000    432       53.52967047691345   196.03800256698023  203.1755249906172   50.40363626132408   234.559073243  58.6985883032  58.2625        4.07511          220.075473938       -0.0228585701529     2880
30000    529       75.70036673545837   235.56116255667774  244.959316060032    65.5906904107698    308.734116893  63.7782974471  76.7096        4.04076          90.4954066086       -0.0185938313883     4480
40000    612       98.57644462585449   298.5987467137777   295.4374736096896   15.087610158460103  331.012917928  280.508147339  94.5398        3.91425          42.130443306        -0.0252347167395     6080
50000    692       121.09591150283813  307.16052597099144  305.9318521663632   25.561373458519842  354.088572517  271.573892487  105.596        3.85068          29.0423781013       -0.0269529265352     7680
60000    767       143.88535976409912  341.32091757739045  335.0886515642578   26.117174184076326  384.679590023  302.265382288  119.59         3.84304          61.0788668823       -0.0292233509012     9280
70000    833       166.57908129692078  481.30483402503506  533.9291307174226   99.82915941912462   576.964486779  341.202198801  149.397        3.76649          94.7531884003       -0.0235387658887     10880
80000    892       189.40896892547607  571.3955151906276   587.450715640421    111.41446081307248  774.195333513  398.419975589  166.726        3.72246          105.573717575       -0.0247433172166     12480
90000    938       209.12788701057434  625.022871246758    642.3335741999259   101.9583978763834   738.005548588  399.814290999  187.233        3.66989          93.4407640457       -0.0256238067895     13760
100000   984       231.9224305152893   653.0509329756936   694.8346476397012   86.46224819172319   741.668312777  465.743020918  207.759        3.64529          38.7520478058       -0.0330281619262     15360
...
900000   2443      2213.9943561553955  2204.7346916865936  2013.318529233562   821.3654992948126   3271.45615298  1162.15681581  303.648        2.52791          453.003696957       -0.00217045221478    140480
910000   2454      2239.933213710785   2512.1779445902494  2947.722877598134   1060.2109104666706  3379.14307158  159.569126079  324.26         2.52689          7.49030535698       -0.00406546894461    142080
920000   2469      2265.0337154865265  2045.4589750967427  2015.3224484319799  1021.203789658981   3278.79579721  187.263689418  303.912        2.5323           427.756504135       -0.00520593222231    143680
930000   2483      2288.82088470459    2245.303265735497   2065.4788090151437  921.9927137410564   3282.04796105  772.55521049   293.316        2.53307          424.59441082        -0.00139752391726    145280
940000   2501      2310.3834960460663  2231.309007423832   1965.5592640485884  802.6221913227743   3281.41406577  1224.20370661  259.317        2.53526          409.415307474       -0.00234937445261    146560
950000   2514      2336.4186446666718  2048.2843472495165  1872.5367246311641  937.5228379081043   3276.89314601  911.95097463   288.585        2.53589          10.9826160192       -0.00318630717695    148160
960000   2529      2361.2262592315674  1898.6765263096627  1610.5396474036106  862.6031870598445   3290.8149184   938.254483019  267.797        2.53871          152.730493793       -0.0019191817753     149760
970000   2543      2386.898174762726   1919.5999372360056  2036.1876198548794  1134.0087366350156  3384.18423823  175.407622055  303.956        2.54022          807.401226616       -0.000969484876841   151360
980000   2556      2412.4946026802063  2611.9648556974535  3045.9116092218883  828.7389862124529   3388.84331323  1446.68570089  315.72         2.54078          9.18778734684       0.00136177802458     152960
990000   2568      2437.182087659836   2173.3044084559933  2270.1881506425652  1070.8516334221054  3365.65676147  174.932504719  285.879        2.54112          9.05191842556       -3.01385298371e-06   154560
1000000  2584      2463.8088488578796  2156.5452950137337  1648.5824644788686  863.6086871830796   3278.54458649  1236.9700591   295.702        2.54156          1138.87739436       -1.04400888085e-05   156160

You can see from the elapsed column that this PR actually makes training a bit faster.

muupan · 2019-04-06T13:35:59Z

After changes in #427 (stop scaling weight initialization at value function and decaying clip eps), I conducted the comparison again, with 10 random seeds each. I confirmed again that this PR improves performance.

keisuke-nakata

LGTM

Precompute log_prob_old for PPO

c0663c2

muupan mentioned this pull request Apr 2, 2019

Recurrent PPO with a stateless recurrent model interface #431

Merged

1 task

muupan added 3 commits April 5, 2019 16:49

Merge branch 'master' into ppo-precompute

57bf969

Merge branch 'master' into ppo-precompute

49efcec

Merge branch 'parallel-link' into ppo-precompute

8b45d63

Merge branch 'master' into ppo-precompute

5b2c7c6

muupan changed the title ~~Precompute log probability in PPO~~ [WIP] Precompute log probability in PPO Apr 11, 2019

Use optimizer.t for n_updates

98c72a8

muupan changed the title ~~[WIP] Precompute log probability in PPO~~ Precompute log probability in PPO Apr 11, 2019

muupan requested a review from toslunar April 11, 2019 06:08

muupan added the enhancement label Apr 15, 2019

keisuke-nakata approved these changes Apr 17, 2019

View reviewed changes

muupan removed the request for review from toslunar April 17, 2019 10:59

keisuke-nakata merged commit 6317b02 into chainer:master Apr 17, 2019

muupan mentioned this pull request Apr 28, 2019

Recurrent and batched TRPO #446

Merged

muupan added this to the v0.7 milestone Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute log probability in PPO #430

Precompute log probability in PPO #430

muupan commented Apr 2, 2019 •

edited

Loading

muupan commented Apr 2, 2019 •

edited

Loading

muupan commented Apr 2, 2019

muupan commented Apr 6, 2019 •

edited

Loading

keisuke-nakata left a comment

Precompute log probability in PPO #430

Precompute log probability in PPO #430

Conversation

muupan commented Apr 2, 2019 • edited Loading

muupan commented Apr 2, 2019 • edited Loading

muupan commented Apr 2, 2019

muupan commented Apr 6, 2019 • edited Loading

keisuke-nakata left a comment

Choose a reason for hiding this comment

muupan commented Apr 2, 2019 •

edited

Loading

muupan commented Apr 2, 2019 •

edited

Loading

muupan commented Apr 6, 2019 •

edited

Loading