-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] support autoregressive action distributions #4939
Comments
What would the use cases be? I'm not sure if I understand this correctly, but all previously sampled actions might need to belong to the same episode (and not previous episodes) to avoid possible biases. |
I believe the use case here is when you are sampling multiple sub actions per step. For example, you might have a Tuple action space with several sub actions. With an autoregressive action distribution the first second sub-action can be conditioned on the first, the third on the first and second, and so on. This all fits nicely into the action distribution interface, though implementation wise there might be some complications, since the autoregressive head can require learnable variables. No current action distributions have internal variables. Let me know if this sounds right, cc @vladfi1 @concretevitamin |
Another way to deal with sub-actions or a hierarchical action space might be to explicitly allow stacking of ActionDistributions, as in a probabilistic graphical model construction, where one distribution's sample op tensor gets plugged into another ActionDistribution as its parameters (and the second ActionDistribution can return both a sample op based on its input and the original input if the action space needs both). |
That sounds right to me. |
From some discussion with @mawright , here are some of the challenges of autoregressive action distributions. These distributions have learnable variables, which leads to two complications:
Another related use case brought up on the dev list is action-dependent masking out of invalid sub-actions, which is a simpler case with no learnable variables: https://groups.google.com/forum/#!topic/ray-dev/ozeuozWv3PY |
I for one would welcome this :)
…On Wed, Jun 12, 2019 at 1:00 PM Eric Liang ***@***.***> wrote:
From some discussion with @mawright <https://github.com/mawright> , here
are some of the challenges of autoregressive action distributions. These
distributions have learnable variables, which leads to two complications:
1. The action distribution class needs to be trained as well. This
could be resolved by moving the action distribution into the model class
itself. That is, a model can declare its own custom action distribution, in
which case the model output will be used as actions verbatim. The model
would have to implement logp(), entropy(), and so on for its action
distribution.
2. There is no longer a small set of "logits" that completely
parameterize the distribution. The parameterization depends on the
variables of the model. This could cause issues for algorithms that depend
on importance sampling such as PPO and IMPALA, unless we changed the logits
to also include the learnable parameters of the autoregressive distribution.
Another related use case brought up on the dev list is action-dependent
masking out of invalid sub-actions, which is a simpler case with no
learnable variables:
https://groups.google.com/forum/#!topic/ray-dev/ozeuozWv3PY
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4939?email_source=notifications&email_token=AAEQWHX4EHYANVZV2MSMODLP2FIVPA5CNFSM4HUSMZR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXRUGUA#issuecomment-501433168>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEQWHUSDDJGNNH7COLZQA3P2FIVPANCNFSM4HUSMZRQ>
.
|
I'm not sure about PPO, but IMPALA only needs the action probabilities, not the full logits. The entropy will have to be estimated from samples (perhaps using the logits at the behavior action). |
@vladfi1 that sounds right, here's my current understanding: Probabilities you might want:
Currently p_learner_old is defined by the logits. However with an autoregressive action dist we will also need to retain a copy of the learner params prior to sgd to compute p_learner_old.
Currently p_worker is computed from the logits, but we could instead save the actual log prob of the actions. p_learner_old is no problem to compute as part of the forward pass. |
probabilistic rl is sweet! |
Support Tuple action distributions where each action element depends on the previously sampled action elements.
This might require changes like #4895
The text was updated successfully, but these errors were encountered: