[rllib] support autoregressive action distributions #4939

ericl · 2019-06-06T01:19:32Z

Support Tuple action distributions where each action element depends on the previously sampled action elements.

This might require changes like #4895

federicofontana · 2019-06-06T09:11:35Z

What would the use cases be?

I'm not sure if I understand this correctly, but all previously sampled actions might need to belong to the same episode (and not previous episodes) to avoid possible biases.

ericl · 2019-06-07T01:27:05Z

I believe the use case here is when you are sampling multiple sub actions per step. For example, you might have a Tuple action space with several sub actions. With an autoregressive action distribution the first second sub-action can be conditioned on the first, the third on the first and second, and so on.

This all fits nicely into the action distribution interface, though implementation wise there might be some complications, since the autoregressive head can require learnable variables. No current action distributions have internal variables.

Let me know if this sounds right, cc @vladfi1 @concretevitamin

mawright · 2019-06-07T19:58:22Z

Another way to deal with sub-actions or a hierarchical action space might be to explicitly allow stacking of ActionDistributions, as in a probabilistic graphical model construction, where one distribution's sample op tensor gets plugged into another ActionDistribution as its parameters (and the second ActionDistribution can return both a sample op based on its input and the original input if the action space needs both).

vladfi1 · 2019-06-07T21:19:44Z

That sounds right to me.

ericl · 2019-06-12T20:00:07Z

From some discussion with @mawright , here are some of the challenges of autoregressive action distributions. These distributions have learnable variables, which leads to two complications:

The action distribution class needs to be trained as well. This could be resolved by moving the action distribution into the model class itself. That is, a model can declare its own custom action distribution, in which case the model output will be used as actions verbatim. The model would have to implement logp(), entropy(), and so on for its action distribution.
There is no longer a small set of "logits" that completely parameterize the distribution. The parameterization depends on the variables of the model. This could cause issues for algorithms that depend on importance sampling such as PPO and IMPALA, unless we changed the logits to also include the learnable parameters of the autoregressive distribution.

Another related use case brought up on the dev list is action-dependent masking out of invalid sub-actions, which is a simpler case with no learnable variables: https://groups.google.com/forum/#!topic/ray-dev/ozeuozWv3PY

concretevitamin · 2019-06-13T00:12:57Z

I for one would welcome this :)

…

On Wed, Jun 12, 2019 at 1:00 PM Eric Liang ***@***.***> wrote: From some discussion with @mawright <https://github.com/mawright> , here are some of the challenges of autoregressive action distributions. These distributions have learnable variables, which leads to two complications: 1. The action distribution class needs to be trained as well. This could be resolved by moving the action distribution into the model class itself. That is, a model can declare its own custom action distribution, in which case the model output will be used as actions verbatim. The model would have to implement logp(), entropy(), and so on for its action distribution. 2. There is no longer a small set of "logits" that completely parameterize the distribution. The parameterization depends on the variables of the model. This could cause issues for algorithms that depend on importance sampling such as PPO and IMPALA, unless we changed the logits to also include the learnable parameters of the autoregressive distribution. Another related use case brought up on the dev list is action-dependent masking out of invalid sub-actions, which is a simpler case with no learnable variables: https://groups.google.com/forum/#!topic/ray-dev/ozeuozWv3PY — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4939?email_source=notifications&email_token=AAEQWHX4EHYANVZV2MSMODLP2FIVPA5CNFSM4HUSMZR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXRUGUA#issuecomment-501433168>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEQWHUSDDJGNNH7COLZQA3P2FIVPANCNFSM4HUSMZRQ> .

vladfi1 · 2019-06-13T10:46:38Z

I'm not sure about PPO, but IMPALA only needs the action probabilities, not the full logits. The entropy will have to be estimated from samples (perhaps using the logits at the behavior action).

ericl · 2019-06-13T20:53:46Z

@vladfi1 that sounds right, here's my current understanding:

Probabilities you might want:

- p_worker(behaviour_a)
- p_learner_old(behaviour_a) (before optimization step)
- p_learner_new(behaviour_a) (after optimization step)

PPO needs: p_learner_old, p_learner_new
(it's also on-policy so it's assumed that p_worker == p_learner_old)

Currently p_learner_old is defined by the logits. However with an autoregressive action dist we will also need to retain a copy of the learner params prior to sgd to compute p_learner_old.

IMPALA needs: p_worker, p_learner_old

Currently p_worker is computed from the logits, but we could instead save the actual log prob of the actions. p_learner_old is no problem to compute as part of the forward pass.

bionicles · 2019-06-18T14:19:22Z

probabilistic rl is sweet!

ericl mentioned this issue Jul 29, 2019

[rllib] Autoregressive action distributions #5304

Merged

5 tasks

ericl closed this as completed in #5304 Aug 10, 2019

theOGognf mentioned this issue Dec 27, 2022

Support for more complex action distributions alex-petrenko/sample-factory#253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] support autoregressive action distributions #4939

[rllib] support autoregressive action distributions #4939

ericl commented Jun 6, 2019

federicofontana commented Jun 6, 2019

ericl commented Jun 7, 2019 •

edited

Loading

mawright commented Jun 7, 2019

vladfi1 commented Jun 7, 2019

ericl commented Jun 12, 2019

concretevitamin commented Jun 13, 2019 via email

vladfi1 commented Jun 13, 2019

ericl commented Jun 13, 2019

bionicles commented Jun 18, 2019

[rllib] support autoregressive action distributions #4939

[rllib] support autoregressive action distributions #4939

Comments

ericl commented Jun 6, 2019

federicofontana commented Jun 6, 2019

ericl commented Jun 7, 2019 • edited Loading

mawright commented Jun 7, 2019

vladfi1 commented Jun 7, 2019

ericl commented Jun 12, 2019

concretevitamin commented Jun 13, 2019 via email

vladfi1 commented Jun 13, 2019

ericl commented Jun 13, 2019

bionicles commented Jun 18, 2019

ericl commented Jun 7, 2019 •

edited

Loading