### Describe the problem Allow environments to emit a variable-length list of valid actions (action embeddings) for each step. Implement this in policy gradient algorithms and in DQN.