-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Some API cleanups and documentation improvements #4409
Conversation
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
@@ -70,7 +71,7 @@ | |||
"exploration_final_eps": 0.02, | |||
# Update the target network every `target_network_update_freq` steps. | |||
"target_network_update_freq": 500, | |||
# Use softmax for sampling actions. | |||
# Use softmax for sampling actions. Required for off policy estimation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is soft q required for off policy estimation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise there's not really a meaningful probability distribution over actions (just eps-greedy exploration). Horizon does the same thing with requiring soft-Q for OPE.
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I think passing in the loss explicitly to
gradients()
also fixes some bugs where the wrong loss was being used.Closes #4411
Fixes #4410