[RLlib] Mutiagent learning: can't combine replay lockstep and multiple agents controlled by the same policy. #9295
Description
Hello,
While implementing MADDPG for PyTorch I noticed that it is not possible to combine the lockstep replay mode configuration and having multiple agents controlled by the same policy. This is due to the implementation of MultiAgentBatch as a dict of PolicyID -> SampleBatch
. In the contrib TensorFlow implementation this issue was circumvented using parameter sharing. Another solution which is framework agnostic would be to group the agents.
I think that this combination should be supported without needing to group the agents. However, I suspect that it would require significant code change. Maybe another solution exists?
Also, the documentation should be updated alongside with the code that checks that the config is valid.
Thanks
(Amazing work btw)