Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train experts for AIRL environments #68

Merged
merged 15 commits into from
Aug 15, 2019
Merged

Train experts for AIRL environments #68

merged 15 commits into from
Aug 15, 2019

Conversation

shwang
Copy link
Member

@shwang shwang commented Jul 31, 2019

For AIRL custom envs that were reported in the paper:

  • Add named configs
  • Include in experiments/mujoco_experts.sh

@shwang
Copy link
Member Author

shwang commented Jul 31, 2019

Training right now, will report results when they are available.

@codecov
Copy link

codecov bot commented Jul 31, 2019

Codecov Report

Merging #68 into master will increase coverage by 1.2%.
The diff coverage is 61.29%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #68     +/-   ##
=========================================
+ Coverage   80.13%   81.33%   +1.2%     
=========================================
  Files          46       47      +1     
  Lines        2753     3139    +386     
=========================================
+ Hits         2206     2553    +347     
- Misses        547      586     +39
Impacted Files Coverage Δ
tests/test_policies.py 100% <100%> (ø) ⬆️
src/imitation/policies/serialize.py 96.66% <100%> (+0.23%) ⬆️
src/imitation/util/registry.py 100% <100%> (ø)
src/imitation/scripts/config/data_collect.py 68.33% <41.17%> (-2.26%) ⬇️
src/imitation/scripts/config/train.py 65.17% <44%> (-6.96%) ⬇️
tests/test_scripts.py 100% <0%> (ø) ⬆️
tests/test_trainer.py 100% <0%> (ø) ⬆️
src/imitation/util/reward_wrapper.py 97.61% <0%> (+0.39%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16fe37e...9a16e3f. Read the comment docs.

@shwang
Copy link
Member Author

shwang commented Aug 6, 2019

Expert results:

(imit) steven@astar:~/imitation/output/mujoco_experts/2019-07-31T16:41:33-07:00$ find . -name stdout | xargs tail -n 15 | grep -E '(==|ep_reward_mean)'
==> ./parallel/env/acrobot/seed/0/stdout <==
| ep_reward_mean     | -75.1         |
==> ./parallel/env/acrobot/seed/2/stdout <==
| ep_reward_mean     | -71.2         |
==> ./parallel/env/acrobot/seed/1/stdout <==
| ep_reward_mean     | -72.8         |

==> ./parallel/env/reacher/seed/0/stdout <==
| ep_reward_mean     | -6.04        |
==> ./parallel/env/reacher/seed/2/stdout <==
| ep_reward_mean     | -6.27          |
==> ./parallel/env/reacher/seed/1/stdout <==
| ep_reward_mean     | -6.02        |

==> ./parallel/env/swimmer/seed/0/stdout <==
| ep_reward_mean     | 43.2          |
==> ./parallel/env/swimmer/seed/2/stdout <==
| ep_reward_mean     | 48.5          |
==> ./parallel/env/swimmer/seed/1/stdout <==
| ep_reward_mean     | 52.6         |

==> ./parallel/env/cartpole/seed/0/stdout <==
| ep_reward_mean     | 480           |
==> ./parallel/env/cartpole/seed/2/stdout <==
| ep_reward_mean     | 386           |
==> ./parallel/env/cartpole/seed/1/stdout <==
| ep_reward_mean     | 443           |

==> ./parallel/env/hopper/seed/0/stdout <==
| ep_reward_mean     | 1.71e+03       |
==> ./parallel/env/hopper/seed/2/stdout <==
| ep_reward_mean     | 1.87e+03     |
==> ./parallel/env/hopper/seed/1/stdout <==
| ep_reward_mean     | 1.87e+03     |

==> ./parallel/env/disabled_ant/seed/0/stdout <==
| ep_reward_mean     | 884          |
==> ./parallel/env/disabled_ant/seed/2/stdout <==
| ep_reward_mean     | 885           |
==> ./parallel/env/disabled_ant/seed/1/stdout <==
| ep_reward_mean     | 805          |

==> ./parallel/env/humanoid/seed/0/stdout <==
| ep_reward_mean     | 4.22e+03     |
==> ./parallel/env/humanoid/seed/2/stdout <==
| ep_reward_mean     | 3.49e+03     |
==> ./parallel/env/humanoid/seed/1/stdout <==
| ep_reward_mean     | 2.53e+03    |

==> ./parallel/env/mountain_car/seed/0/stdout <==
| ep_reward_mean     | -101          |
==> ./parallel/env/mountain_car/seed/2/stdout <==
| ep_reward_mean     | -100          |
==> ./parallel/env/mountain_car/seed/1/stdout <==
| ep_reward_mean     | -97.6         |

==> ./parallel/env/walker/seed/0/stdout <==
| ep_reward_mean     | 3.7e+03      |
==> ./parallel/env/walker/seed/2/stdout <==
| ep_reward_mean     | 2.89e+03     |
==> ./parallel/env/walker/seed/1/stdout <==
| ep_reward_mean     | 3.94e+03    |

==> ./parallel/env/two_d_maze/seed/0/stdout <==
| ep_reward_mean     | -5.02         |
==> ./parallel/env/two_d_maze/seed/2/stdout <==
| ep_reward_mean     | -40.8       |
==> ./parallel/env/two_d_maze/seed/1/stdout <==
| ep_reward_mean     | -41.7       |

==> ./parallel/env/ant/seed/0/stdout <==
| ep_reward_mean     | 4.09e+03     |
==> ./parallel/env/ant/seed/2/stdout <==
| ep_reward_mean     | 3.22e+03     |
==> ./parallel/env/ant/seed/1/stdout <==
| ep_reward_mean     | 2.53e+03     |

==> ./parallel/env/half_cheetah/seed/0/stdout <==
| ep_reward_mean     | 4.86e+03     |
==> ./parallel/env/half_cheetah/seed/2/stdout <==
| ep_reward_mean     | 2.41e+03     |
==> ./parallel/env/half_cheetah/seed/1/stdout <==
| ep_reward_mean     | 3.82e+03     |

==> ./parallel/env/custom_ant/seed/0/stdout <==
| ep_reward_mean     | 1.83e+03     |
==> ./parallel/env/custom_ant/seed/2/stdout <==
| ep_reward_mean     | 1.94e+03     |
==> ./parallel/env/custom_ant/seed/1/stdout <==
| ep_reward_mean     | 1.96e+03     |

@shwang
Copy link
Member Author

shwang commented Aug 6, 2019

We didn't reach expert on CartPole (500)! The other tasks look okay AFAIK without knowing the expert reward of any of them except of MountainCar (-110 is considered solved).

@AdamGleave
Copy link
Member

Most of these tasks don't have an established threshold for optimal performance, but look in RL papers to see what rewards people typically get. PPO, SAC and Rainbow paper are good sources.

Eyeballing it, Reacher, Hopper, HalfCheetah, Humanoid, Ant, MountainCar all seem to be performing well. Cartpole should be able to get to 500 but 450 isn't all that bad. I don't know about the other environments.

I'm not familiar with the custom environments, but Justin's paper gives some numbers for a TRPO expert (table 1 & 2). We seem to do a lot better on disabled_ant (880 vs 315). custom_ant seems closer, we get ~1.8k and they report ~1.5k. I'd expect PPO to do a little better than TRPO, but the difference in disabled_ant seems extreme. Possible they had different length episodes, or that the MuJoCo version change is really significant.

@shwang
Copy link
Member Author

shwang commented Aug 13, 2019

eyeballed some numbers and noted version changes.
image

For these mujoco environments, the difference between -v1 and -v2 should be insignificant. (See openai/gym#834)

edit: Adding this side-by-side table
image

@shwang
Copy link
Member Author

shwang commented Aug 13, 2019

image

Our best results out of 3 do better than some of their reported PPO means but worse on Humanoid and Hopper. I'm not going to worry about our experts not matching previously reported experts for now -- just noting in case we want to think about this later on.

@shwang
Copy link
Member Author

shwang commented Aug 13, 2019

but the difference in disabled_ant seems extreme. Possible they had different length episodes, or that the MuJoCo version change is really significant.

Maybe (don't have a good prior on how much Mujoco change matters). I don't see why the episode length would change if I ported the code correctly.

If disabled_ant reward discrepancies seems important to look over later, I can carefully read through #55 to make sure that I didn't accidentally make a change to the environment when porting it over.

@shwang shwang changed the title AIRL data collect Train experts for airl environments Aug 13, 2019
@shwang shwang changed the title Train experts for airl environments Train experts for AIRL environments Aug 13, 2019
@shwang
Copy link
Member Author

shwang commented Aug 13, 2019

I preemptively commited registry.load into this PR because it was helpful for handling a PPO1 failure when mpi4py isn't installed.

@shwang shwang requested a review from AdamGleave August 13, 2019 23:18
@AdamGleave
Copy link
Member

Maybe (don't have a good prior on how much Mujoco change matters). I don't see why the episode length would change if I ported the code correctly.

If disabled_ant reward discrepancies seems important to look over later, I can carefully read through #55 to make sure that I didn't accidentally make a change to the environment when porting it over.

Have we replicated the disabled ant experiment using Justin's codebase? If not, we shouldn't assume that the environment (and in particular episode length) in that code and the paper is the same.

A 2x difference feels significant to me. Worth quickly running the code in Justin's codebase. Seems worth running https://github.com/AdamGleave/inverse_rl/blob/master/scripts/ant_data_collect.py but with the environment changed to 'airl/DisabledAnt-v0' to see what expert performance we get with TRPO in original codebase.

Our best results out of 3 do better than some of their reported PPO means but worse on Humanoid and Hopper. I'm not going to worry about our experts not matching previously reported experts for now -- just noting in case we want to think about this later on.

This seems reasoable. The differences aren't huge nor are they always in one direction, so seems plausibly down to differently tuned hyperparameters and random seeds. If we were going to do a publication-quality benchmark of different IRL methods we might want to push more on getting the best possible experts, but for testing correctness of our implementation of IRL methods, I think the current experts are fine.

For these mujoco environments, the difference between -v1 and -v2 should be insignificant. (See openai/gym#834)

Thanks, I'd seen this a while ago but forgot about it, does seem like the MuJoCo version different shouldn't matter that much.

Copy link
Member

@AdamGleave AdamGleave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, only minor changes needed.

src/imitation/policies/serialize.py Show resolved Hide resolved
src/imitation/scripts/config/data_collect.py Outdated Show resolved Hide resolved
src/imitation/scripts/config/train.py Outdated Show resolved Hide resolved
tests/test_policies.py Show resolved Hide resolved
@shwang shwang requested a review from AdamGleave August 15, 2019 05:02
@shwang
Copy link
Member Author

shwang commented Aug 15, 2019

Have we replicated the disabled ant experiment using Justin's codebase? If not, we shouldn't assume that the environment (and in particular episode length) in that code and the paper is the same.

Looking into this now

edit: astar is down, and couldn't figure out how to install rllab+inverse_rl properly fresh on perceptron. Will try again tomorrow.

Copy link
Member

@AdamGleave AdamGleave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shwang shwang merged commit 98a7929 into master Aug 15, 2019
@shwang shwang deleted the airl_data_collect branch August 15, 2019 19:01
@shwang
Copy link
Member Author

shwang commented Aug 19, 2019

DisabledAnt

For DisabledAnt experiment (legacy TRPO, 1 seed) I get around 390 Average Return.

2019-08-17 13:25:20.114989 PDT | AverageReturn              380.602
2019-08-17 13:25:33.891566 PDT | AverageReturn              391.266
2019-08-17 13:26:17.628041 PDT | AverageReturn              396.033
2019-08-17 13:26:20.762647 PDT | AverageReturn              390.332
2019-08-17 13:26:29.459419 PDT | AverageReturn              379.702
2019-08-17 13:26:42.913788 PDT | AverageReturn              396.057

The paper reports 315.5.
Our PPO (best of 3) reports 880, more than a 2x change.

CustomAnt

For CustomAnt experiment (legacy TRPO, 1 seed) I get around 800 Average Return.

The paper reports 1537.9 Average Return.

My PPO (best of 3) reports 1960 Average Return, more than 2x change.

@AdamGleave
Copy link
Member

These are substantial differences. It could just be a PPO vs TRPO difference: the PPO paper did report quite substantial improvements on some MuJoCo differences (e.g. HalfCheetah, Reacher) while others were comparable (e.g. Hopper, Swimmer, InvertedPendulum).

Do we definitely have a consistent episode length (it should be reported somewhere...)? In https://github.com/shwang/inverse_rl/blob/master/scripts/ant_data_collect.py the max_path_length is set to 500, whereas CustomAntEnv in https://github.com/HumanCompatibleAI/imitation/blob/master/src/imitation/envs/examples/airl_envs/ant_env.py#L546 has max_timesteps set to 1000. So if you're not correcting for this, we will get around a 2x better episode reward.

@shwang
Copy link
Member Author

shwang commented Aug 20, 2019

Ah, thanks for catching that! That would definitely make a ~2x change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants