Description
Hello everyone, this discussion is the beginning of an extension of the TorchRL MARL API.
Hope to get your feedback.
Potential TorchRL MARL API
This API proposes a general structure that multi-agent environments can use in TorchRL to pass their data to the library. It will not be enforced. Its core tenet is that data processed by the same neural network structure should be stacked (grouped) together to leverage tensor batching and data that is processed by different neural networks should be kept under different keys.
Data format
Agents have observations, done, reward and actions. These values can be processed by the same component or processed by different components. If some values across agents are processed by the same component, they should be stacked (grouped) together under the same key. Grouping happens within a TensorDict with an additional dimension to represent the group size.
Users can optionally maintain in the env a table map from each group to its members.
Let's see a few examples.
Case 1: all agents’ data is processed together
In this example, all agents data will be processed by the same neural network so it is convenient to stack them creating a tensordict with an “n_agents” dimension
TensorDict(
"agents": (
"obs_a": Tensor,
"obs_b": Tensor,
"action": Tensor,
"done": Tensor,
"reward": Tensor,
batch_size=[*B,n_agents]),
"state": Tensor,
batch_size=[*B])
In this example "agents" is the group.
It means that each tensor in “agents” will have a leading shape [*B,n_agents] and can be passed to the same neural network.
Optionally, we can maintain a map from group to agents. Supposing we have 3 agents named "agent_0", "agent_1", "agent_2", we can see that they will be all part of the "agents" group by doing
env.group_map["agents"] = ["agent_0", "agent_1", "agent_2"]
In the above example, all the keys under the "agents" group have an agent dimension. If some keys are, on the other hand, shared (like "state") they should be put in the root TensorDict outside of the group to highlight that they are missing the agent dimension. For example, if done and reward were shared by all agents we would have:
TensorDict(
"agents": (
"obs_a": Tensor,
"obs_b": Tensor,
"action": Tensor,
batch_size=[*B,n_agents]),
"state": Tensor,
"done": Tensor,
"reward": Tensor,
batch_size=[*B])
Example neural network for this case
A policy for this use case can look something like
TensorDictSequential(
TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b")], out_keys=[("agents","action")])
)
A value network for this use case can look something like
TensorDictSequential(
TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b"),"state"], out_keys=["value"]),
)
Note that even if the agents share the same processing, different parameters can be used for each agent via the use of vmap.
This API is currently supported in TrochRL and it can be used with VMAS. You can see how in this tutorial.
Case 2: some groups of agents share data processing
Sometimes only part of the agents share the data processing. This is because agents might be physically different (heterogeneous) or have different behaviors (neural networks) associated with them (like in MLAgents). Once again we use tensordicts to group agents that share data processing
TensorDict(
"group_1": (
"obs_a": Tensor,
"action": Tensor,
"done": Tensor,
"reward": Tensor,
batch_size=[*B, n_group_1]),
"group_2": (
"obs_a": Tensor,
"action": Tensor,
"done": Tensor,
"reward": Tensor,
batch_size=[*B, n_group_2]),
"state": Tensor,
batch_size=[*B])
Agents can still share “reward” or “done”, in this case you can do like above and put this key out of the groups.
We can check the group membership again, in the group map we can optionally keep:
env.group_map["group_1"] = ["agent_0", "agent_1"]
env.group_map["group_2"] = ["agent_2"]
Example neural network for this case
An example policy
TensorDictSequential(
TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","action")]),
TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","action")]),
)
An example policy sharing an hidden state
TensorDictSequential(
TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","hidden")]),
TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","hidden")]),
Module(lambda y1,y2: torch.cat([y1, y2],-2), in_keys=[[("group_1","hidden"), [("group_2","hidden")], out_keys=[“hidden”]),
TensorDictModule(in_keys=[“hidden_groups”], out_keys=["hidden_processed"]),
Module(lambda y: (y[:n_group_1,:],y[n_group_1:,:]), in_keys=["hidden_processed"], out_keys=[("group_1","action"),("group_2","action")]),
)
This API is suited for environments with APIs using behavior or groups, such as MLAgents.
Case 3: no agents share processing (groups correspond to individual agents)
All agents can also be independent and each have their own group
TensorDict(
"agent_0": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"agent_1": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"agent_2": (
"obs_a": Tensor,
"action": Tensor,
"reward": Tensor,
"done": Tensor,
batch_size=[*B]),
"state": Tensor,
batch_size=[*B])
again we can check that each agent belongs to a group
env.group_map["agent_0"] = ["agent_0"]
env.group_map["agent_1"] = ["agent_1"]
env.group_map["agent_2"] = ["agent_2"]
Example neural network for this case
Exactly like in case 2
This API is suited for environments treating agents as completely independent, such as PettingZoo parallel envs or RLlib.
Important notes (suggested)
- A group is a nested tensordict with an action key
- The reward and done keys can only be present EITHER: in the root td, or in each and all group tds
- the sum of the group sizes is the number of agents
- each agent has to belong to one and only one group
Changes required in the library
- Allow multiple (nested) action, reward, done keys in [Feature] Allow multiple (nested) action, reward, done keys in
env
,vec_env
andcollectors
#1462 - Multiple keys will have to be accounted for also in advantages, losses and modules.
@hyerra @smorad @Acciorocketships @pseudo-rnd-thoughts @RiqiangGao @btx0424 @mattiasmar @vmoens @janblumenkamp