-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train experts for AIRL environments #68
Conversation
Training right now, will report results when they are available. |
Codecov Report
@@ Coverage Diff @@
## master #68 +/- ##
=========================================
+ Coverage 80.13% 81.33% +1.2%
=========================================
Files 46 47 +1
Lines 2753 3139 +386
=========================================
+ Hits 2206 2553 +347
- Misses 547 586 +39
Continue to review full report at Codecov.
|
Includes MPI hang fix
Expert results:
|
We didn't reach expert on CartPole (500)! The other tasks look okay AFAIK without knowing the expert reward of any of them except of MountainCar (-110 is considered solved). |
Most of these tasks don't have an established threshold for optimal performance, but look in RL papers to see what rewards people typically get. PPO, SAC and Rainbow paper are good sources. Eyeballing it, Reacher, Hopper, HalfCheetah, Humanoid, Ant, MountainCar all seem to be performing well. Cartpole should be able to get to 500 but 450 isn't all that bad. I don't know about the other environments. I'm not familiar with the custom environments, but Justin's paper gives some numbers for a TRPO expert (table 1 & 2). We seem to do a lot better on |
eyeballed some numbers and noted version changes. For these mujoco environments, the difference between -v1 and -v2 should be insignificant. (See openai/gym#834) |
Maybe (don't have a good prior on how much Mujoco change matters). I don't see why the episode length would change if I ported the code correctly. If disabled_ant reward discrepancies seems important to look over later, I can carefully read through #55 to make sure that I didn't accidentally make a change to the environment when porting it over. |
I preemptively commited |
Have we replicated the disabled ant experiment using Justin's codebase? If not, we shouldn't assume that the environment (and in particular episode length) in that code and the paper is the same. A 2x difference feels significant to me. Worth quickly running the code in Justin's codebase. Seems worth running https://github.com/AdamGleave/inverse_rl/blob/master/scripts/ant_data_collect.py but with the environment changed to 'airl/DisabledAnt-v0' to see what expert performance we get with TRPO in original codebase.
This seems reasoable. The differences aren't huge nor are they always in one direction, so seems plausibly down to differently tuned hyperparameters and random seeds. If we were going to do a publication-quality benchmark of different IRL methods we might want to push more on getting the best possible experts, but for testing correctness of our implementation of IRL methods, I think the current experts are fine.
Thanks, I'd seen this a while ago but forgot about it, does seem like the MuJoCo version different shouldn't matter that much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, only minor changes needed.
Looking into this now edit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
DisabledAntFor DisabledAnt experiment (legacy TRPO, 1 seed) I get around 390 Average Return.
The paper reports 315.5. CustomAntFor CustomAnt experiment (legacy TRPO, 1 seed) I get around 800 Average Return. The paper reports 1537.9 Average Return. My PPO (best of 3) reports 1960 Average Return, more than 2x change. |
These are substantial differences. It could just be a PPO vs TRPO difference: the PPO paper did report quite substantial improvements on some MuJoCo differences (e.g. HalfCheetah, Reacher) while others were comparable (e.g. Hopper, Swimmer, InvertedPendulum). Do we definitely have a consistent episode length (it should be reported somewhere...)? In https://github.com/shwang/inverse_rl/blob/master/scripts/ant_data_collect.py the |
Ah, thanks for catching that! That would definitely make a ~2x change. |
For AIRL custom envs that were reported in the paper:
experiments/mujoco_experts.sh