-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RandomizeAction wrapper instead of Explorer in evaluation #328
Conversation
Resolved the conflict. |
@@ -87,6 +88,9 @@ def make_env(process_idx, test): | |||
episode_life=not test, | |||
clip_rewards=not test) | |||
env.seed(int(env_seed)) | |||
if test: | |||
# Randomize actions like epsilon-greedy in evaluation as well | |||
env = chainerrl.wrappers.RandomizeAction(env, 0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that you use 0.05 instead of args.eval_epsilon as in the other algorithms? I'm aware that the DQN paper uses 0.05, but then why not use a raw value for the other domains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so just because the previous examples/ale/train_nsq_ale.py
did so. I agree that it's better to allow configuring it, but since this is not relevant to this PR, I kept it unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; Perhaps we should create an issue or a new PR to make the 0.05 in NSQ parameterizable.
We currently use
Explorer
to inject randomness to action selection in evaluation, which is common in Atari benchmarks. This use is actually irrelevant to exploration in training, so is a misnomer in my opinion.I think we better use an env wrapper for this purpose, so that the training and evaluation code can be simpler. This is also important in #326 , where I need to implement another set of code for training and evaluation.