[rllib] Support parametric discrete action spaces


### Describe the problem

Allow environments to emit a variable-length list of valid actions (action embeddings) for each step.

Implement this in policy gradient algorithms and in DQN.