Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I couldn't get good result for GAIL in any environments except HalfCheetah. #204

Open
slee01 opened this issue Aug 30, 2019 · 3 comments
Open

Comments

@slee01
Copy link

slee01 commented Aug 30, 2019

Hi, first of all, thank you for sharing your code.

I've been trying to implement GAIL using expert demonstrations from your Google Drive. I used the hyper-parameters from gail_experts/readme and I got good result from HalfCheetah. But, I got bad result than I expected from others such as Hopper, Ant, Walker2d(I coudn't test for Reacher. I guess the expert data, which is only 240KB has some problem.) I tried again with different hyper-parameters including seed, but unfortunately still got the same result. So could you share the parameters you used for these environments I failed? It would help comparison test for my research a lot.

@slee01 slee01 changed the title I couldn't get good result than I expected other than HalfCheetah using GAIL. I couldn't get good result other than HalfCheetah using GAIL. Aug 30, 2019
@slee01 slee01 changed the title I couldn't get good result other than HalfCheetah using GAIL. I couldn't get good result in any environments except HalfCheetah. Aug 30, 2019
@slee01 slee01 changed the title I couldn't get good result in any environments except HalfCheetah. I couldn't get good result for GAIL in any environments except HalfCheetah. Aug 30, 2019
@ikostrikov
Copy link
Owner

For the moment, the easiest way to fix the problem is to change the reward function and turn normalization off:
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/algo/gail.py#L98

See the comments here:
https://github.com/openai/imitation/blob/99fbccf3e060b6e6c739bdf209758620fcdefd3c/policyopt/imitation.py#L146

You need to use this reward specifically:

rewards_B = -tensor.log(1.-tensor.nnet.sigmoid(scores_B))

@slee01
Copy link
Author

slee01 commented Sep 2, 2019

This was very helpful to me.

I figured out the standard deviation of reward from discriminator is much higher than that from mujoco simulators.

I also understood that the reward range should be different depending on the episode end option.

I finally got good results after modified the reward function.

But I'm not sure why the value network can be trained without reward normalization.

And I'm wondering that there is some reason why you normalize the reward from discriminator knowing the standard deviation of that reward is too high.

I think clipping is more proper than normalization for the reward function in discriminator.

Could you comment on these questions, please?

Thanks!

@wang88256187
Copy link

hi, I meet similar problem, my results is always bad in the GAIL. Can you share your experiences on this problem in detail? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants