[Discussion] TorchRL MARL API

Hello everyone, this discussion is the beginning of an extension of the TorchRL MARL API.
Hope to get your feedback.

# Potential TorchRL MARL API

_This API proposes a general structure that multi-agent environments can use in TorchRL to pass their data to the library. It will not be enforced. Its core tenet is that data processed by the same neural network structure should be stacked (grouped) together to leverage tensor batching and data that is processed by different neural networks should be kept under different keys._ 

## Data format

Agents have observations, done, reward and actions. **These values can be processed by the same component or processed by different components**. If some values across agents are processed by the same component, they should be stacked (grouped) together under the same key. Grouping happens within a TensorDict with an additional dimension to represent the group size. 

Users can optionally maintain in the env a table map from each group to its members.

Let's see a few examples.

### Case 1: all agents’ data is processed together

In this example, all agents data will be processed by the same neural network so it is convenient to stack them creating a tensordict with an “n_agents” dimension

```python
TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
batch_size=[*B])

```

In this example "agents" is the group. 
It means that each tensor in “agents” will have a leading shape [*B,n_agents] and can be passed to the same neural network.

Optionally, we can maintain a map from group to agents. Supposing we have 3 agents named "agent_0", "agent_1", "agent_2", we can see that they will be all part of the "agents" group by doing 

```python
env.group_map["agents"] = ["agent_0", "agent_1", "agent_2"]
```

In the above example, all the keys under the "agents" group have an agent dimension. If some keys are, on the other hand, shared (like "state") they should be put in the root TensorDict outside of the group to highlight that they are missing the agent dimension. For example, if done and reward were shared by all agents we would have:


```python
TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
    "done": Tensor,
    "reward": Tensor,
batch_size=[*B])

```


#### Example neural network for this case
A policy  for this use case can look something like

```python
TensorDictSequential(
    TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b")], out_keys=[("agents","action")])
)
```

A value network for this use case can look something like

```python
TensorDictSequential(
      TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b"),"state"], out_keys=["value"]),
)
```

_Note that even if the agents share the same processing, different parameters can be used for each agent via the use of vmap._

**This API is currently supported in TrochRL and it can be used with [VMAS](https://github.com/proroklab/VectorizedMultiAgentSimulator). You can see how in [this tutorial](https://pytorch.org/rl/tutorials/multiagent_ppo.html).**


### Case 2: some groups of agents share data processing

Sometimes only part of the agents share the data processing. This is because agents might be physically different (heterogeneous) or have different behaviors (neural networks) associated with them (like in [MLAgents](https://github.com/Unity-Technologies/ml-agents)). Once again we use tensordicts to group agents that share data processing


```python
TensorDict(
    "group_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_1]),
    "group_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_2]),
    "state": Tensor,
batch_size=[*B])
```

Agents can still share “reward” or “done”, in this case you can do like above and put this key out of the groups.

We can check the group membership again, in the group map we can optionally keep:

```python
env.group_map["group_1"] = ["agent_0", "agent_1"]
env.group_map["group_2"] = ["agent_2"]
```


#### Example neural network for this case
An example policy

```python
TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","action")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","action")]),
    )
```

An example policy sharing an hidden state

```python
TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","hidden")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","hidden")]),
    	Module(lambda y1,y2: torch.cat([y1, y2],-2), in_keys=[[("group_1","hidden"), [("group_2","hidden")], out_keys=[“hidden”]),
        TensorDictModule(in_keys=[“hidden_groups”], out_keys=["hidden_processed"]),
        Module(lambda y: (y[:n_group_1,:],y[n_group_1:,:]), in_keys=["hidden_processed"], out_keys=[("group_1","action"),("group_2","action")]),
)
```

**This API is suited for environments with APIs using behavior or groups, such as [MLAgents](https://github.com/Unity-Technologies/ml-agents).**

### Case 3: no agents share processing (groups correspond to individual agents)

All agents can also be independent and each have their own group

```python
TensorDict(
    "agent_0": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
     "agent_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "agent_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

```

again we can check that each agent belongs to a group


```python
env.group_map["agent_0"] = ["agent_0"]
env.group_map["agent_1"] = ["agent_1"]
env.group_map["agent_2"] = ["agent_2"]
```

#### Example neural network for this case

Exactly like in case 2


**This API is suited for environments treating agents as completely independent, such as [PettingZoo](https://github.com/Farama-Foundation/PettingZoo) parallel envs or [RLlib](https://docs.ray.io/en/latest/rllib/index.html).**

## Important notes (suggested)

- A group is a nested tensordict with an action key
- The reward and done keys can only be present EITHER: in the root td, or in each and all group tds
- the sum of the group sizes is the number of agents
- each agent has to belong to one and only one group 

## Changes required in the library

- Allow multiple (nested) action, reward, done keys in #1462 
- Multiple keys will have to be accounted for also in advantages, losses and modules. 


@hyerra @smorad @acciorocketships @pseudo-rnd-thoughts @RiqiangGao @btx0424 @mattiasmar  @vmoens @janblumenkamp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] TorchRL MARL API #1463

Potential TorchRL MARL API

Data format

Case 1: all agents’ data is processed together

Example neural network for this case

Case 2: some groups of agents share data processing

Example neural network for this case

Case 3: no agents share processing (groups correspond to individual agents)

Example neural network for this case

Important notes (suggested)

Changes required in the library

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] TorchRL MARL API #1463

Description

Potential TorchRL MARL API

Data format

Case 1: all agents’ data is processed together

Example neural network for this case

Case 2: some groups of agents share data processing

Example neural network for this case

Case 3: no agents share processing (groups correspond to individual agents)

Example neural network for this case

Important notes (suggested)

Changes required in the library

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions