-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[docs] Documentation for POCA and cooperative behaviors #5056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d2e315d
5cf76e3
8708f70
44fb8b5
56f9dbf
a468075
db184d9
32cbdee
c90472c
d429b53
44093f2
1dc0059
2b5b994
cd84fe3
eed2fce
fe41094
3822b18
3f4b2b5
c7c7d4c
f391b35
f706a91
cee5466
195978c
10f336e
f0bf657
e03c79e
1118089
13a90b1
36d1b5b
376d500
dd8b5fb
8673820
fb86a57
1beea7d
9e69790
3c2b9d1
b4b9d72
7d5f3e3
b7c5533
53e1277
2134004
60c6071
47cfae4
d31da21
541d062
d3c4372
3407478
f84ca50
287c1b9
f73ef80
10a416a
e716199
45349b8
9a6474e
04d9617
5bbb222
2868694
c9b4e71
a10caaf
44c616d
ef01af4
f329e1d
fbd1749
908b1df
d4073ce
7d8f2b5
9452239
39adec6
14bb6fd
7cb5dbc
761a206
3afae60
f0dfada
c3d8d8e
c3d84c5
f5419aa
d7a2386
cdc6dde
4f35048
05c8ea1
a7f2fc2
6d2be2c
b812da4
0c3dbff
09590ad
c982c06
7e3d976
c40fec0
ffb3f0b
f87cfbd
8b8e916
6b71f5a
87e97dd
d3d1dc1
2ba09ca
5587e48
7e51ad1
f25b171
128b09b
4690c4e
dbdd045
30c846f
dd7f867
f36f696
a1b7e75
236f398
96278d0
5f8cbc5
293ec08
7d20bd9
a22c621
c669226
2dc90a9
70207a3
b22d0ae
49282f6
204b45b
7eacfba
016ffd8
d7e2ca6
8b9d662
3fb14b9
4e4ecad
492fd17
9f6eca7
7292672
78e052b
81d8389
39f92c3
1d500d6
6418e05
944997a
527ca06
eb15030
4d215cf
9fac4b1
d5a30f1
6da8dd3
ad4a821
9725aa5
65b5992
f5190fe
664ae89
8f696f4
31da276
77557ca
6464cb6
cbfdfb3
6badfb5
ef67f53
8e78dbd
31ee1c4
1e4c837
cba26b2
2113a43
ba9896c
5679e2f
0e28c07
6936004
97d1b80
ce2e7b1
dbcf313
9a00053
fce4ad3
2c03d2b
f70c345
33a27e0
b39e873
f879b61
f86e7b4
01ca5df
6d7a604
587e3da
fd4aa53
8dbea77
ec9e5ad
e1f48db
d42896a
b3f2689
7085461
7005daa
3658310
b2100c1
7b1f805
8096b11
8359ca3
8c18a80
cc9f5c0
c34837c
107bb3d
4e82760
4b7db51
5905680
d7d622a
7ec4b34
277d66f
2a93ca1
743ede0
edfdbdc
c15abe4
1a01dd4
e12744a
71da407
17bcd7f
b42f393
2866d08
ede29cc
c46b601
df9ae94
83a44ef
6be73e6
bf08535
a238fe4
1837968
d4fe1f2
369aa86
2071092
0a5092b
b6e70ce
2c83696
f243181
2df0296
ff25fc4
a7d2a65
2d0ee89
8046811
ab6b1d5
0f4201a
56548dd
3b91d38
5acbcd6
52ea4a7
20c8759
2ed7f46
bb04d14
8511f9f
445c1f0
f98c615
65af6ff
adbe1b2
4c0986d
3ed9702
4134d54
a500410
ea7914e
d972351
ed462aa
92ad505
12fdc1d
0ff8ac8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,7 +29,7 @@ | |
- [Rewards Summary & Best Practices](#rewards-summary--best-practices) | ||
- [Agent Properties](#agent-properties) | ||
- [Destroying an Agent](#destroying-an-agent) | ||
- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios) | ||
- [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios) | ||
- [Recording Demonstrations](#recording-demonstrations) | ||
|
||
An agent is an entity that can observe its environment, decide on the best | ||
|
@@ -537,7 +537,7 @@ the padded observations. Note that attention layers are invariant to | |
the order of the entities, so there is no need to properly "order" the | ||
entities before feeding them into the `BufferSensor`. | ||
|
||
The the `BufferSensorComponent` Editor inspector have two arguments: | ||
The `BufferSensorComponent` Editor inspector has two arguments: | ||
- `Observation Size` : This is how many floats each entities will be | ||
represented with. This number is fixed and all entities must | ||
have the same representation. For example, if the entities you want to | ||
|
@@ -900,7 +900,9 @@ is always at least one Agent training at all times by either spawning a new | |
Agent every time one is destroyed or by re-spawning new Agents when the whole | ||
environment resets. | ||
|
||
## Defining Teams for Multi-agent Scenarios | ||
## Defining Multi-agent Scenarios | ||
|
||
### Teams for Adversarial Scenarios | ||
|
||
Self-play is triggered by including the self-play hyperparameter hierarchy in | ||
the [trainer configuration](Training-ML-Agents.md#training-configurations). To | ||
|
@@ -927,6 +929,99 @@ provide examples of symmetric games. To train an asymmetric game, specify | |
trainer configurations for each of your behavior names and include the self-play | ||
hyperparameter hierarchy in both. | ||
|
||
### Groups for Cooperative Scenarios | ||
|
||
Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`, | ||
typically in an environment controller or similar script, and adding agents to it | ||
using the `RegisterAgent(Agent agent)` method. Note that all agents added to the same `SimpleMultiAgentGroup` | ||
must have the same behavior name and Behavior Parameters. Using `SimpleMultiAgentGroup` enables the | ||
agents within a group to learn how to work together to achieve a common goal (i.e., | ||
maximize a group-given reward), even if one or more of the group members are removed | ||
before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes | ||
at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and | ||
`GroupEpisodeInterrupted()` methods. For example: | ||
|
||
```csharp | ||
// Create a Multi Agent Group in Start() or Initialize() | ||
m_AgentGroup = new SimpleMultiAgentGroup(); | ||
|
||
// Register agents in group at the beginning of an episode | ||
for (var agent in AgentList) | ||
{ | ||
m_AgentGroup.RegisterAgent(agent); | ||
} | ||
|
||
// if the team scores a goal | ||
m_AgentGroup.AddGroupReward(rewardForGoal); | ||
|
||
// If the goal is reached and the episode is over | ||
m_AgentGroup.EndGroupEpisode(); | ||
ResetScene(); | ||
|
||
// If time ran out and we need to interrupt the episode | ||
m_AgentGroup.GroupEpisodeInterrupted(); | ||
ResetScene(); | ||
``` | ||
|
||
Multi Agent Groups should be used with the MA-POCA trainer, which is explicitly designed to train | ||
cooperative environments. This can be enabled by using the `poca` trainer - see the | ||
[training configurations](Training-Configuration-File.md) doc for more information on | ||
configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene | ||
during the episode will still learn to contribute to the group's long term rewards, even | ||
if they are not active in the scene to experience them. | ||
|
||
**NOTE**: Groups differ from Teams (for competitive settings) in the following way - Agents | ||
working together should be added to the same Group, while agents playing against each other | ||
should be given different Team Ids. If in the Scene there is one playing field and two teams, | ||
there should be two Groups, one for each team, and each team should be assigned a different | ||
Team Id. If this playing field is duplicated many times in the Scene (e.g. for training | ||
speedup), there should be two Groups _per playing field_, and two unique Team Ids | ||
_for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and | ||
self-play can be used together for training. In the diagram below, there are two agents on each team, | ||
and two playing fields where teams are pitted against each other. All the blue agents should share a Team Id | ||
(and the orange ones a different ID), and there should be four group managers, one per pair of agents. | ||
|
||
<p align="center"> | ||
<img src="images/groupmanager_teamid.png" | ||
alt="Group Manager vs Team Id" | ||
width="650" border="10" /> | ||
</p> | ||
|
||
#### Cooperative Behaviors Notes and Best Practices | ||
* An agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an | ||
agent from one group to another, you have to unregister it from the current group first. | ||
|
||
* Agents with different behavior names in the same group are not supported. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the first time this is mentioned (I think). This section is a summary, so it should be called out earlier. |
||
|
||
* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0. | ||
Instead, handle Max Steps using the MultiAgentGroup by ending the episode for the entire | ||
Group using `GroupEpisodeInterrupted()`. | ||
andrewcoh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has | ||
slightly different effect on the training. If the episode is completed, you would want to call | ||
`EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e. | ||
reaching max step, you would call `GroupEpisodeInterrupted`. | ||
Comment on lines
+1000
to
+1003
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is not specific to GroupTraining and should be called out in a more general documentation page. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we never explicitly called this out since we handle all the max_step stuff for single agent so users don't need to know about this. The only place that used |
||
|
||
* If an agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call | ||
`EndEpisode()` on the Agent. Instead, disable the agent and re-enable it when the next episode starts, | ||
or destroy the agent entirely. This is because calling `EndEpisode()` will call `OnEpisodeBegin()`, which | ||
will reset the agent immediately. While it is possible to call `EndEpisode()` in this way, it is usually not the | ||
desired behavior when training groups of agents. | ||
|
||
* If an agent that was disabled in a scene needs to be re-enabled, it must be re-registered to the MultiAgentGroup. | ||
|
||
* Group rewards are meant to reinforce agents to act in the group's best interest instead of | ||
andrewcoh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
individual ones, and are treated differently than individual agent rewards during | ||
training. So calling `AddGroupReward()` is not equivalent to calling agent.AddReward() on each agent | ||
in the group. | ||
|
||
* You can still add incremental rewards to agents using `Agent.AddReward()` if they are | ||
in a Group. These rewards will only be given to those agents and are received when the | ||
Agent is active. | ||
|
||
* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will | ||
not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively. | ||
|
||
## Recording Demonstrations | ||
|
||
In order to record demonstrations from an agent, add the | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -553,7 +553,7 @@ In addition to the three environment-agnostic training methods introduced in the | |
previous section, the ML-Agents Toolkit provides additional methods that can aid | ||
in training behaviors for specific types of environments. | ||
|
||
### Training in Multi-Agent Environments with Self-Play | ||
### Training in Competitive Multi-Agent Environments with Self-Play | ||
|
||
ML-Agents provides the functionality to train both symmetric and asymmetric | ||
adversarial games with | ||
|
@@ -588,6 +588,37 @@ our | |
[blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/) | ||
for additional information. | ||
|
||
### Training In Cooperative Multi-Agent Environments with MA-POCA | ||
|
||
 | ||
|
||
ML-Agents provides the functionality for training cooperative behaviors - i.e., | ||
groups of agents working towards a common goal, where the success of the individual | ||
is linked to the success of the whole group. In such a scenario, agents typically receive | ||
rewards as a group. For instance, if a team of agents wins a game against an opposing | ||
team, everyone is rewarded - even agents who did not directly contribute to the win. This | ||
makes learning what to do as an individual difficult - you may get a win | ||
for doing nothing, and a loss for doing your best. | ||
|
||
In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we say "paper coming soon" or something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is fine to not say anything. Although I am worried someone will coin the name. |
||
is a novel multi-agent trainer that trains a _centralized critic_, a neural network | ||
that acts as a "coach" for a whole group of agents. You can then give rewards to the team | ||
as a whole, and the agents will learn how best to contribute to achieving that reward. | ||
Agents can _also_ be given rewards individually, and the team will work together to help the | ||
individual achieve those goals. During an episode, agents can be added or removed from the group, | ||
such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die | ||
or are removed from the game), they will still learn whether their actions contributed | ||
to the team winning later, enabling agents to take group-beneficial actions even if | ||
they result in the individual being removed from the game (i.e., self-sacrifice). | ||
MA-POCA can also be combined with self-play to train teams of agents to play against each other. | ||
|
||
To learn more about enabling cooperative behaviors for agents in an ML-Agents environment, | ||
check out [this page](Learning-Environment-Design-Agents.md#cooperative-scenarios). | ||
|
||
For further reading, MA-POCA builds on previous work in multi-agent cooperative learning | ||
([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf), | ||
among others) to enable the above use-cases. | ||
|
||
### Solving Complex Tasks using Curriculum Learning | ||
|
||
Curriculum learning is a way of training a machine learning model where more | ||
|
Uh oh!
There was an error while loading. Please reload this page.