Skip to content

[Discussion] TorchRL MARL API #1463

Closed
Closed
@matteobettini

Description

@matteobettini

Hello everyone, this discussion is the beginning of an extension of the TorchRL MARL API.
Hope to get your feedback.

Potential TorchRL MARL API

This API proposes a general structure that multi-agent environments can use in TorchRL to pass their data to the library. It will not be enforced. Its core tenet is that data processed by the same neural network structure should be stacked (grouped) together to leverage tensor batching and data that is processed by different neural networks should be kept under different keys.

Data format

Agents have observations, done, reward and actions. These values can be processed by the same component or processed by different components. If some values across agents are processed by the same component, they should be stacked (grouped) together under the same key. Grouping happens within a TensorDict with an additional dimension to represent the group size.

Users can optionally maintain in the env a table map from each group to its members.

Let's see a few examples.

Case 1: all agents’ data is processed together

In this example, all agents data will be processed by the same neural network so it is convenient to stack them creating a tensordict with an “n_agents” dimension

TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
batch_size=[*B])

In this example "agents" is the group.
It means that each tensor in “agents” will have a leading shape [*B,n_agents] and can be passed to the same neural network.

Optionally, we can maintain a map from group to agents. Supposing we have 3 agents named "agent_0", "agent_1", "agent_2", we can see that they will be all part of the "agents" group by doing

env.group_map["agents"] = ["agent_0", "agent_1", "agent_2"]

In the above example, all the keys under the "agents" group have an agent dimension. If some keys are, on the other hand, shared (like "state") they should be put in the root TensorDict outside of the group to highlight that they are missing the agent dimension. For example, if done and reward were shared by all agents we would have:

TensorDict(
    "agents": (
        "obs_a": Tensor,
        "obs_b": Tensor,
        "action": Tensor,
    batch_size=[*B,n_agents]),
    "state": Tensor,
    "done": Tensor,
    "reward": Tensor,
batch_size=[*B])

Example neural network for this case

A policy for this use case can look something like

TensorDictSequential(
    TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b")], out_keys=[("agents","action")])
)

A value network for this use case can look something like

TensorDictSequential(
      TensorDictModule(in_keys=[("agents","obs_a"),("agents","obs_b"),"state"], out_keys=["value"]),
)

Note that even if the agents share the same processing, different parameters can be used for each agent via the use of vmap.

This API is currently supported in TrochRL and it can be used with VMAS. You can see how in this tutorial.

Case 2: some groups of agents share data processing

Sometimes only part of the agents share the data processing. This is because agents might be physically different (heterogeneous) or have different behaviors (neural networks) associated with them (like in MLAgents). Once again we use tensordicts to group agents that share data processing

TensorDict(
    "group_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_1]),
    "group_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "done": Tensor,
        "reward": Tensor,
    batch_size=[*B, n_group_2]),
    "state": Tensor,
batch_size=[*B])

Agents can still share “reward” or “done”, in this case you can do like above and put this key out of the groups.

We can check the group membership again, in the group map we can optionally keep:

env.group_map["group_1"] = ["agent_0", "agent_1"]
env.group_map["group_2"] = ["agent_2"]

Example neural network for this case

An example policy

TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","action")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","action")]),
    )

An example policy sharing an hidden state

TensorDictSequential(
    	TensorDictModule(in_keys=[("group_1","obs_a")], out_keys=[("group_1","hidden")]),
    	TensorDictModule(in_keys=[("group_2","obs_a")], out_keys=[("group_1","hidden")]),
    	Module(lambda y1,y2: torch.cat([y1, y2],-2), in_keys=[[("group_1","hidden"), [("group_2","hidden")], out_keys=[“hidden”]),
        TensorDictModule(in_keys=[“hidden_groups”], out_keys=["hidden_processed"]),
        Module(lambda y: (y[:n_group_1,:],y[n_group_1:,:]), in_keys=["hidden_processed"], out_keys=[("group_1","action"),("group_2","action")]),
)

This API is suited for environments with APIs using behavior or groups, such as MLAgents.

Case 3: no agents share processing (groups correspond to individual agents)

All agents can also be independent and each have their own group

TensorDict(
    "agent_0": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
     "agent_1": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "agent_2": (
        "obs_a": Tensor,
        "action": Tensor,
        "reward": Tensor,
        "done": Tensor,
    batch_size=[*B]),
    "state": Tensor,
batch_size=[*B])

again we can check that each agent belongs to a group

env.group_map["agent_0"] = ["agent_0"]
env.group_map["agent_1"] = ["agent_1"]
env.group_map["agent_2"] = ["agent_2"]

Example neural network for this case

Exactly like in case 2

This API is suited for environments treating agents as completely independent, such as PettingZoo parallel envs or RLlib.

Important notes (suggested)

  • A group is a nested tensordict with an action key
  • The reward and done keys can only be present EITHER: in the root td, or in each and all group tds
  • the sum of the group sizes is the number of agents
  • each agent has to belong to one and only one group

Changes required in the library

@hyerra @smorad @Acciorocketships @pseudo-rnd-thoughts @RiqiangGao @btx0424 @mattiasmar @vmoens @janblumenkamp

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions