Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] support autoregressive action distributions #4939

Closed
ericl opened this issue Jun 6, 2019 · 9 comments · Fixed by #5304
Closed

[rllib] support autoregressive action distributions #4939

ericl opened this issue Jun 6, 2019 · 9 comments · Fixed by #5304

Comments

@ericl
Copy link
Contributor

ericl commented Jun 6, 2019

Support Tuple action distributions where each action element depends on the previously sampled action elements.

This might require changes like #4895

@federicofontana
Copy link
Contributor

What would the use cases be?

I'm not sure if I understand this correctly, but all previously sampled actions might need to belong to the same episode (and not previous episodes) to avoid possible biases.

@ericl
Copy link
Contributor Author

ericl commented Jun 7, 2019

I believe the use case here is when you are sampling multiple sub actions per step. For example, you might have a Tuple action space with several sub actions. With an autoregressive action distribution the first second sub-action can be conditioned on the first, the third on the first and second, and so on.

This all fits nicely into the action distribution interface, though implementation wise there might be some complications, since the autoregressive head can require learnable variables. No current action distributions have internal variables.

Let me know if this sounds right, cc @vladfi1 @concretevitamin

@mawright
Copy link
Contributor

mawright commented Jun 7, 2019

Another way to deal with sub-actions or a hierarchical action space might be to explicitly allow stacking of ActionDistributions, as in a probabilistic graphical model construction, where one distribution's sample op tensor gets plugged into another ActionDistribution as its parameters (and the second ActionDistribution can return both a sample op based on its input and the original input if the action space needs both).

@vladfi1
Copy link
Contributor

vladfi1 commented Jun 7, 2019

That sounds right to me.

@ericl
Copy link
Contributor Author

ericl commented Jun 12, 2019

From some discussion with @mawright , here are some of the challenges of autoregressive action distributions. These distributions have learnable variables, which leads to two complications:

  1. The action distribution class needs to be trained as well. This could be resolved by moving the action distribution into the model class itself. That is, a model can declare its own custom action distribution, in which case the model output will be used as actions verbatim. The model would have to implement logp(), entropy(), and so on for its action distribution.
  2. There is no longer a small set of "logits" that completely parameterize the distribution. The parameterization depends on the variables of the model. This could cause issues for algorithms that depend on importance sampling such as PPO and IMPALA, unless we changed the logits to also include the learnable parameters of the autoregressive distribution.

Another related use case brought up on the dev list is action-dependent masking out of invalid sub-actions, which is a simpler case with no learnable variables: https://groups.google.com/forum/#!topic/ray-dev/ozeuozWv3PY

@concretevitamin
Copy link
Contributor

concretevitamin commented Jun 13, 2019 via email

@vladfi1
Copy link
Contributor

vladfi1 commented Jun 13, 2019

I'm not sure about PPO, but IMPALA only needs the action probabilities, not the full logits. The entropy will have to be estimated from samples (perhaps using the logits at the behavior action).

@ericl
Copy link
Contributor Author

ericl commented Jun 13, 2019

@vladfi1 that sounds right, here's my current understanding:

Probabilities you might want:

- p_worker(behaviour_a)
- p_learner_old(behaviour_a) (before optimization step)
- p_learner_new(behaviour_a) (after optimization step)
  • PPO needs: p_learner_old, p_learner_new
    (it's also on-policy so it's assumed that p_worker == p_learner_old)

Currently p_learner_old is defined by the logits. However with an autoregressive action dist we will also need to retain a copy of the learner params prior to sgd to compute p_learner_old.

  • IMPALA needs: p_worker, p_learner_old

Currently p_worker is computed from the logits, but we could instead save the actual log prob of the actions. p_learner_old is no problem to compute as part of the forward pass.

@bionicles
Copy link

probabilistic rl is sweet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants