You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given how popular this repo is (and rightly so), I was thinking it might be a good idea to implement some simple tricks that have been shown to improve performance with on-policy RL algorithms. I'm thinking mostly about this paper: https://arxiv.org/pdf/2006.05990.pdf, where they do a large scale study of all of the little decisions that can make a big difference in performance.
I haven't ran extensive experiments but I've implemented a couple of the things they mention and they do seem to significantly boost performance. In particular, modifying the code so that the advantages are recomputed every epoch of the update as they recommend does seem to improve performance. And then an even simpler thing with the initialisation seems to make an even bigger difference - for continuous control initialising the action std in a way such that initially its value is 0.5 for each dimension, and then multiplying the weights of the output policy layer by 0.01 at the start (there are a lot of other things they discuss too in that paper).
The text was updated successfully, but these errors were encountered:
@henrycharlesworth I have tried a number of suggestions proposed in the paper you mentioned (ablation studies suggest some of them are useful, some are temporarily not) and implement "recompute advantage" strategy, which is helpful indeed. It is in my benchmark of mujoco here, check out the details if you are interested.
Given how popular this repo is (and rightly so), I was thinking it might be a good idea to implement some simple tricks that have been shown to improve performance with on-policy RL algorithms. I'm thinking mostly about this paper: https://arxiv.org/pdf/2006.05990.pdf, where they do a large scale study of all of the little decisions that can make a big difference in performance.
I haven't ran extensive experiments but I've implemented a couple of the things they mention and they do seem to significantly boost performance. In particular, modifying the code so that the advantages are recomputed every epoch of the update as they recommend does seem to improve performance. And then an even simpler thing with the initialisation seems to make an even bigger difference - for continuous control initialising the action std in a way such that initially its value is 0.5 for each dimension, and then multiplying the weights of the output policy layer by 0.01 at the start (there are a lot of other things they discuss too in that paper).
The text was updated successfully, but these errors were encountered: