Skip to content

[docs] Documentation for POCA and cooperative behaviors #5056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 300 commits into from
Mar 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
300 commits
Select commit Hold shift + click to select a range
d2e315d
Make comms one-hot
Dec 21, 2020
5cf76e3
Fix S tag
Dec 23, 2020
8708f70
Merge branch 'master' into develop-centralizedcritic-mm
Jan 4, 2021
44fb8b5
Additional changes
Jan 4, 2021
56f9dbf
Some more fixes
Jan 4, 2021
a468075
Self-attention Centralized Critic
Jan 6, 2021
db184d9
separate entity encoder and RSA
andrewcoh Jan 11, 2021
32cbdee
clean up args in mha
andrewcoh Jan 11, 2021
c90472c
more cleanups
andrewcoh Jan 11, 2021
d429b53
fixed tests
andrewcoh Jan 11, 2021
44093f2
Merge branch 'develop-attention-refactor' into develop-centralizedcri…
Jan 11, 2021
1dc0059
Merge branch 'develop-attention-refactor' into develop-centralizedcri…
Jan 11, 2021
2b5b994
entity embeddings work with no max
Jan 11, 2021
cd84fe3
remove group id
Jan 11, 2021
eed2fce
very rough sketch for TeamManager interface
Jan 8, 2021
fe41094
One layer for entity embed
Jan 12, 2021
3822b18
Use 4 heads
Jan 12, 2021
3f4b2b5
add defaults to linear encoder, initialize ent encoders
andrewcoh Jan 12, 2021
c7c7d4c
Merge branch 'master' into develop-centralizedcritic-mm
Jan 12, 2021
f391b35
Merge branch 'develop-lin-enc-def' into develop-centralizedcritic-mm
Jan 12, 2021
f706a91
add team manager id to proto
Jan 12, 2021
cee5466
team manager for hallway
Jan 12, 2021
195978c
add manager to hallway
Jan 12, 2021
10f336e
send and process team manager id
Jan 12, 2021
f0bf657
remove print
Jan 12, 2021
e03c79e
Merge branch 'develop-centralizedcritic-mm' into develop-cc-teammanager
Jan 12, 2021
1118089
small cleanup
Jan 13, 2021
13a90b1
default behavior for baseTeamManager
Jan 13, 2021
36d1b5b
add back statsrecorder
Jan 13, 2021
376d500
update
Jan 13, 2021
dd8b5fb
Team manager prototype (#4850)
Jan 13, 2021
8673820
Remove statsrecorder
Jan 13, 2021
fb86a57
Fix AgentProcessor for TeamManager
Jan 13, 2021
1beea7d
Merge branch 'develop-centralizedcritic-mm' into develop-cc-teammanager
Jan 13, 2021
9e69790
team manager
Jan 13, 2021
3c2b9d1
New buffer layout, TeamObsUtil, pad dead agents
Jan 14, 2021
b4b9d72
Use NaNs to get masks for attention
Jan 14, 2021
7d5f3e3
Add team reward to buffer
Jan 15, 2021
b7c5533
Try subtract marginalized value
Jan 15, 2021
53e1277
Add Q function with attention
Jan 20, 2021
2134004
Some more progress - still broken
Jan 20, 2021
60c6071
use singular entity embedding (#4873)
andrewcoh Jan 20, 2021
47cfae4
I think it's running
Jan 20, 2021
d31da21
Actions added but untested
Jan 21, 2021
541d062
Fix issue with team_actions
Jan 22, 2021
d3c4372
Add next action and next team obs
Jan 22, 2021
3407478
separate forward into q_net and baseline
andrewcoh Jan 22, 2021
f84ca50
Merge branch 'develop-centralizedcritic-counterfact' into develop-coma2
andrewcoh Jan 22, 2021
287c1b9
might be right
andrewcoh Jan 22, 2021
f73ef80
forcing this to work
andrewcoh Jan 22, 2021
10a416a
buffer error
andrewcoh Jan 22, 2021
e716199
COMAA runs
andrewcoh Jan 23, 2021
45349b8
add lambda return and target network
andrewcoh Jan 23, 2021
9a6474e
no target net
andrewcoh Jan 24, 2021
04d9617
remove normalize advantages
andrewcoh Jan 24, 2021
5bbb222
add target network back
andrewcoh Jan 24, 2021
2868694
value estimator
andrewcoh Jan 24, 2021
c9b4e71
update coma config
andrewcoh Jan 24, 2021
a10caaf
add target net
andrewcoh Jan 24, 2021
44c616d
no target, increase lambda
andrewcoh Jan 24, 2021
ef01af4
remove prints
andrewcoh Jan 24, 2021
f329e1d
cloud config
andrewcoh Jan 24, 2021
fbd1749
use v return
andrewcoh Jan 25, 2021
908b1df
use target net
andrewcoh Jan 25, 2021
d4073ce
adding zombie to coma2 brnch
andrewcoh Jan 25, 2021
7d8f2b5
add callbacks
andrewcoh Jan 25, 2021
9452239
cloud run with coma2 of held out zombie test env
andrewcoh Jan 25, 2021
39adec6
target of baseline is returns_v
andrewcoh Jan 26, 2021
14bb6fd
remove target update
andrewcoh Jan 26, 2021
7cb5dbc
Add team dones
Jan 26, 2021
761a206
ntegrate teammate dones
andrewcoh Jan 26, 2021
3afae60
add value clipping
andrewcoh Jan 26, 2021
f0dfada
try again on cloud
andrewcoh Jan 26, 2021
c3d8d8e
clipping values and updated zombie
andrewcoh Jan 27, 2021
c3d84c5
update configs
andrewcoh Jan 27, 2021
f5419aa
remove value head clipping
andrewcoh Jan 27, 2021
d7a2386
update zombie config
andrewcoh Jan 27, 2021
cdc6dde
Add trust region to COMA updates
Jan 29, 2021
4f35048
Remove Q-net for perf
Jan 29, 2021
05c8ea1
Weight decay, regularizaton loss
Jan 29, 2021
a7f2fc2
Use same network
Jan 29, 2021
6d2be2c
add base team manager
Feb 1, 2021
b812da4
Remove reg loss, still stable
Feb 4, 2021
0c3dbff
Black format
Feb 4, 2021
09590ad
add team reward field to agent and proto
Feb 5, 2021
c982c06
set team reward
Feb 5, 2021
7e3d976
add maxstep to teammanager and hook to academy
Feb 5, 2021
c40fec0
check agent by agent.enabled
Feb 8, 2021
ffb3f0b
remove manager from academy when dispose
Feb 9, 2021
f87cfbd
move manager
Feb 9, 2021
8b8e916
put team reward in decision steps
Feb 9, 2021
6b71f5a
use 0 as default manager id
Feb 9, 2021
87e97dd
fix setTeamReward
Feb 9, 2021
d3d1dc1
change method name to GetRegisteredAgents
Feb 9, 2021
2ba09ca
address comments
Feb 9, 2021
5587e48
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 9, 2021
7e51ad1
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 9, 2021
f25b171
Revert C# env changes
Feb 9, 2021
128b09b
Remove a bunch of stuff from envs
Feb 9, 2021
4690c4e
Remove a bunch of extra files
Feb 9, 2021
dbdd045
Remove changes from base-teammanager
Feb 9, 2021
30c846f
Remove remaining files
Feb 9, 2021
dd7f867
Remove some unneeded changes
Feb 9, 2021
f36f696
Make buffer typing neater
Feb 9, 2021
a1b7e75
AgentProcessor fixes
Feb 9, 2021
236f398
Back out trainer changes
Feb 9, 2021
96278d0
Separate Actor/Critic, remove ActorCritics
andrewcoh Feb 9, 2021
5f8cbc5
update policy to not use critic
andrewcoh Feb 9, 2021
293ec08
add critic to optimizer, ppo runs
andrewcoh Feb 9, 2021
7d20bd9
fix precommit errors
andrewcoh Feb 9, 2021
a22c621
use delegate to avoid agent-manager cyclic reference
Feb 9, 2021
c669226
fix test_networks
andrewcoh Feb 9, 2021
2dc90a9
put team reward in decision steps
Feb 9, 2021
70207a3
fix unregister agents
Feb 10, 2021
b22d0ae
Update SAC to use separate policy
Feb 10, 2021
49282f6
add teamreward to decision step
Feb 10, 2021
204b45b
typo
Feb 10, 2021
7eacfba
unregister on disabled
Feb 10, 2021
016ffd8
remove OnTeamEpisodeBegin
Feb 10, 2021
d7e2ca6
make critic a property
andrewcoh Feb 10, 2021
8b9d662
change name TeamManager to MultiAgentGroup
Feb 11, 2021
3fb14b9
more team -> group
Feb 11, 2021
4e4ecad
fix tests
Feb 11, 2021
492fd17
fix tests
Feb 11, 2021
9f6eca7
remove commented code
andrewcoh Feb 11, 2021
7292672
Merge remote-tracking branch 'origin/develop-base-teammanager' into d…
Feb 11, 2021
78e052b
Use attention tests from master
Feb 11, 2021
81d8389
Revert "Use attention tests from master"
Feb 11, 2021
39f92c3
Use attention from master
Feb 11, 2021
1d500d6
Renaming fest
Feb 11, 2021
6418e05
Use NamedTuples instead of attrs classes
Feb 11, 2021
944997a
fix saver test
andrewcoh Feb 11, 2021
527ca06
Move value network for SAC to device
Feb 11, 2021
eb15030
Merge remote-tracking branch 'origin/develop-critic-optimizer' into d…
Feb 11, 2021
4d215cf
add SharedActorCritic
andrewcoh Feb 11, 2021
9fac4b1
test for SharedActorCritic
andrewcoh Feb 11, 2021
d5a30f1
fix agent processor test
andrewcoh Feb 11, 2021
6da8dd3
Bug fixes
Feb 11, 2021
ad4a821
remove GroupMaxStep
Feb 12, 2021
9725aa5
add some doc
Feb 12, 2021
65b5992
fix sac shared
andrewcoh Feb 12, 2021
f5190fe
Fix mock brain
Feb 12, 2021
664ae89
np float32 fixes
Feb 12, 2021
8f696f4
more renaming
Feb 12, 2021
31da276
fix test policy
andrewcoh Feb 12, 2021
77557ca
Test for team obs in agentprocessor
Feb 12, 2021
6464cb6
Test for group and add team reward
Feb 12, 2021
cbfdfb3
doc improve
Feb 12, 2021
6badfb5
Merge branch 'master' into develop-base-teammanager
Feb 13, 2021
ef67f53
Merge branch 'master' into develop-base-teammanager
Feb 13, 2021
8e78dbd
Merge branch 'develop-base-teammanager' of https://github.com/Unity-T…
Feb 13, 2021
31ee1c4
store registered agents in set
Feb 16, 2021
1e4c837
remove unused step counts
Feb 17, 2021
cba26b2
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 17, 2021
2113a43
Global group ids
Feb 17, 2021
ba9896c
coma trainer and optimizer
andrewcoh Feb 17, 2021
5679e2f
MultiInputNetBody
andrewcoh Feb 17, 2021
0e28c07
Fix Trajectory test
Feb 19, 2021
6936004
Merge branch 'master' into develop-agentprocessor-teammanager
Feb 23, 2021
97d1b80
Remove duplicated files
Feb 23, 2021
ce2e7b1
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
Feb 23, 2021
dbcf313
Running COMA (not sure if learning)
Feb 23, 2021
9a00053
Add team methods to AgentAction
Feb 23, 2021
fce4ad3
Right loss function for stability, fix some pypi
Feb 23, 2021
2c03d2b
Buffer fixes
Feb 23, 2021
f70c345
Group reward function
Feb 23, 2021
33a27e0
Add PushBlockCollab config and fix some stuff
Feb 24, 2021
b39e873
Fix Team Cumulative Reward
Feb 24, 2021
f879b61
Buffer fixes
Feb 23, 2021
f86e7b4
clean ups (#5003)
andrewcoh Feb 24, 2021
01ca5df
Merge branch 'master' into develop-coma2-trainer
andrewcoh Feb 24, 2021
6d7a604
Add test for GroupObs
Feb 24, 2021
587e3da
Change AgentAction back to 0 pad and add tests
Feb 24, 2021
fd4aa53
Addressed some comments
Feb 24, 2021
8dbea77
Address some comments
Feb 25, 2021
ec9e5ad
Add more comments
Feb 25, 2021
e1f48db
Rename internal function
Feb 25, 2021
d42896a
Move padding method to AgentBufferField
Feb 25, 2021
b3f2689
Merge branch 'main' into develop-agentprocessor-teammanager
Feb 25, 2021
7085461
checkout ppo/optimizer from main
andrewcoh Feb 25, 2021
7005daa
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
andrewcoh Feb 25, 2021
3658310
Fix slicing typing and string printing in AgentBufferField
Feb 25, 2021
b2100c1
Fix slicing typing and string printing in AgentBufferField
Feb 25, 2021
7b1f805
Fix to-flat and add tests
Feb 25, 2021
8096b11
clean ups
andrewcoh Feb 26, 2021
8359ca3
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
andrewcoh Feb 26, 2021
8c18a80
add inital coma optimizer tests
andrewcoh Feb 26, 2021
cc9f5c0
Faster NaN masking, fix masking for visual obs (#5015)
Feb 26, 2021
c34837c
get value estimate test
andrewcoh Mar 3, 2021
107bb3d
Merge branch 'main' into develop-coma2-trainer
Mar 4, 2021
4e82760
inital evaluate_by_seq, does not run
andrewcoh Mar 4, 2021
4b7db51
finished evaluate_by_seq, does not run
andrewcoh Mar 4, 2021
5905680
ignoring precommit, grabbing baseline/critic mems from buffer in trainer
andrewcoh Mar 4, 2021
d7d622a
lstm almost runs
andrewcoh Mar 4, 2021
7ec4b34
lstm runs with coma
andrewcoh Mar 4, 2021
277d66f
[coma2] Make group extrinsic reward part of extrinsic (#5033)
Mar 5, 2021
2a93ca1
[coma2] Add support for variable length obs in COMA2 (#5038)
Mar 5, 2021
743ede0
Fix pypi issues
Mar 5, 2021
edfdbdc
Action slice (#5047)
andrewcoh Mar 5, 2021
c15abe4
Merge branch 'main' into develop-coma2-trainer
Mar 5, 2021
1a01dd4
add torch no_grad to coma LSTM value computation
andrewcoh Mar 8, 2021
e12744a
torch coma tests: lstm, cur, gail
andrewcoh Mar 8, 2021
71da407
Fix warning message format
Mar 8, 2021
17bcd7f
Fix warning message formatting again
Mar 8, 2021
b42f393
Documentation for COMA2 and cooperative behaviors
Mar 8, 2021
2866d08
Add link to Environment Design
Mar 8, 2021
ede29cc
Remove typo
Mar 8, 2021
c46b601
C# multiagent doc (#5058)
Mar 9, 2021
df9ae94
Remove redundant heading
Mar 9, 2021
83a44ef
Add changelog entries
Mar 9, 2021
6be73e6
Address comments
Mar 9, 2021
bf08535
[coma2] Group reward reporting fix (#5065)
Mar 9, 2021
a238fe4
Copy obs before removing NaNs (#5069)
Mar 10, 2021
1837968
Change comments on COMA2 trainer
Mar 10, 2021
d4fe1f2
Merge branch 'main' into develop-coma2-trainer
Mar 10, 2021
369aa86
Properly restore and test COMA2 optimizer
Mar 10, 2021
2071092
multiinput netbody tests
andrewcoh Mar 10, 2021
0a5092b
Merge branch 'develop-coma2-trainer' of https://github.com/Unity-Tech…
andrewcoh Mar 10, 2021
b6e70ce
cleanup some mypy types (#5072)
Mar 10, 2021
2c83696
coma -> poca
Mar 10, 2021
f243181
Rename folders and files
Mar 10, 2021
2df0296
Fix two more coma references
Mar 10, 2021
ff25fc4
Update docs
Mar 10, 2021
a7d2a65
Multiagent simplerl (#5066)
andrewcoh Mar 10, 2021
2d0ee89
add docstrings to network body
andrewcoh Mar 10, 2021
8046811
ource /Users/ervin/.virtualenvs/mlagents-38/bin/activate
Mar 10, 2021
ab6b1d5
Update tests
Mar 10, 2021
0f4201a
rename to MultiAgentNetwork, docstring
andrewcoh Mar 10, 2021
56548dd
fix references to ppo
andrewcoh Mar 10, 2021
3b91d38
docstrings to poca optimizer
andrewcoh Mar 10, 2021
5acbcd6
Update pushblock screenshot
Mar 11, 2021
52ea4a7
Add description of Dungeon Escape
Mar 11, 2021
20c8759
Move common loss functions for PPO and POCA (#5079)
Mar 11, 2021
2ed7f46
Turn on the SimpleMultiAgentGroup
Mar 11, 2021
bb04d14
Add dungeon escape screenshot
Mar 11, 2021
8511f9f
[poca] Remove add_groupmate_rewards from settings (#5082)
Mar 11, 2021
445c1f0
Merge branch 'main' into develop-coma2-trainer
Mar 11, 2021
f98c615
Untrack PB Collab Config
Mar 11, 2021
65af6ff
Update comment and fix reporting of group dones
Mar 11, 2021
adbe1b2
Merge branch 'develop-coma2-trainer' into develop-coma2-docs
Mar 11, 2021
4c0986d
Address comments
Mar 11, 2021
3ed9702
Fix coma reference
Mar 11, 2021
4134d54
Remove mention of envs
Mar 11, 2021
a500410
Add diagram, correct capitalizations
Mar 11, 2021
ea7914e
correct some more capitalizations
Mar 11, 2021
d972351
Address some comments
Mar 11, 2021
ed462aa
Remove dungeon escape
Mar 11, 2021
92ad505
Clean up docs a bit
Mar 12, 2021
12fdc1d
Fix references to MultiAgentGroup
Mar 12, 2021
0ff8ac8
Merge branch 'main' into develop-coma2-docs
Mar 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions com.unity.ml-agents/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,11 @@ and this project adheres to
### Major Changes
#### com.unity.ml-agents (C#)
- The `BufferSensor` and `BufferSensorComponent` have been added. They allow the Agent to observe variable number of entities. (#4909)
- The `SimpleMultiAgentGroup` class and `IMultiAgentGroup` interface have been added. These allow Agents to be given rewards and
end episodes in groups. (#4923)
#### ml-agents / ml-agents-envs / gym-unity (Python)
- The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure
`poca` as the trainer in the configuration YAML after instantiating a `SimpleMultiAgentGroup` to use this feature. (#5005)

### Minor Changes
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
Expand Down
101 changes: 98 additions & 3 deletions docs/Learning-Environment-Design-Agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
- [Rewards Summary & Best Practices](#rewards-summary--best-practices)
- [Agent Properties](#agent-properties)
- [Destroying an Agent](#destroying-an-agent)
- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios)
- [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios)
- [Recording Demonstrations](#recording-demonstrations)

An agent is an entity that can observe its environment, decide on the best
Expand Down Expand Up @@ -537,7 +537,7 @@ the padded observations. Note that attention layers are invariant to
the order of the entities, so there is no need to properly "order" the
entities before feeding them into the `BufferSensor`.

The the `BufferSensorComponent` Editor inspector have two arguments:
The `BufferSensorComponent` Editor inspector has two arguments:
- `Observation Size` : This is how many floats each entities will be
represented with. This number is fixed and all entities must
have the same representation. For example, if the entities you want to
Expand Down Expand Up @@ -900,7 +900,9 @@ is always at least one Agent training at all times by either spawning a new
Agent every time one is destroyed or by re-spawning new Agents when the whole
environment resets.

## Defining Teams for Multi-agent Scenarios
## Defining Multi-agent Scenarios

### Teams for Adversarial Scenarios

Self-play is triggered by including the self-play hyperparameter hierarchy in
the [trainer configuration](Training-ML-Agents.md#training-configurations). To
Expand All @@ -927,6 +929,99 @@ provide examples of symmetric games. To train an asymmetric game, specify
trainer configurations for each of your behavior names and include the self-play
hyperparameter hierarchy in both.

### Groups for Cooperative Scenarios

Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`,
typically in an environment controller or similar script, and adding agents to it
using the `RegisterAgent(Agent agent)` method. Note that all agents added to the same `SimpleMultiAgentGroup`
must have the same behavior name and Behavior Parameters. Using `SimpleMultiAgentGroup` enables the
agents within a group to learn how to work together to achieve a common goal (i.e.,
maximize a group-given reward), even if one or more of the group members are removed
before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes
at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and
`GroupEpisodeInterrupted()` methods. For example:

```csharp
// Create a Multi Agent Group in Start() or Initialize()
m_AgentGroup = new SimpleMultiAgentGroup();

// Register agents in group at the beginning of an episode
for (var agent in AgentList)
{
m_AgentGroup.RegisterAgent(agent);
}

// if the team scores a goal
m_AgentGroup.AddGroupReward(rewardForGoal);

// If the goal is reached and the episode is over
m_AgentGroup.EndGroupEpisode();
ResetScene();

// If time ran out and we need to interrupt the episode
m_AgentGroup.GroupEpisodeInterrupted();
ResetScene();
```

Multi Agent Groups should be used with the MA-POCA trainer, which is explicitly designed to train
cooperative environments. This can be enabled by using the `poca` trainer - see the
[training configurations](Training-Configuration-File.md) doc for more information on
configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene
during the episode will still learn to contribute to the group's long term rewards, even
if they are not active in the scene to experience them.

**NOTE**: Groups differ from Teams (for competitive settings) in the following way - Agents
working together should be added to the same Group, while agents playing against each other
should be given different Team Ids. If in the Scene there is one playing field and two teams,
there should be two Groups, one for each team, and each team should be assigned a different
Team Id. If this playing field is duplicated many times in the Scene (e.g. for training
speedup), there should be two Groups _per playing field_, and two unique Team Ids
_for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and
self-play can be used together for training. In the diagram below, there are two agents on each team,
and two playing fields where teams are pitted against each other. All the blue agents should share a Team Id
(and the orange ones a different ID), and there should be four group managers, one per pair of agents.

<p align="center">
<img src="images/groupmanager_teamid.png"
alt="Group Manager vs Team Id"
width="650" border="10" />
</p>

#### Cooperative Behaviors Notes and Best Practices
* An agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an
agent from one group to another, you have to unregister it from the current group first.

* Agents with different behavior names in the same group are not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time this is mentioned (I think). This section is a summary, so it should be called out earlier.


* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0.
Instead, handle Max Steps using the MultiAgentGroup by ending the episode for the entire
Group using `GroupEpisodeInterrupted()`.

* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has
slightly different effect on the training. If the episode is completed, you would want to call
`EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e.
reaching max step, you would call `GroupEpisodeInterrupted`.
Comment on lines +1000 to +1003
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not specific to GroupTraining and should be called out in a more general documentation page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we never explicitly called this out since we handle all the max_step stuff for single agent so users don't need to know about this. The only place that used Agent.EpisodeInterrupted is in Match3 where different agents will make moves at different frequencies.
Maybe we can add a section about manually requesting decision and manually handling max_step_reached (in separate PR)?


* If an agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call
`EndEpisode()` on the Agent. Instead, disable the agent and re-enable it when the next episode starts,
or destroy the agent entirely. This is because calling `EndEpisode()` will call `OnEpisodeBegin()`, which
will reset the agent immediately. While it is possible to call `EndEpisode()` in this way, it is usually not the
desired behavior when training groups of agents.

* If an agent that was disabled in a scene needs to be re-enabled, it must be re-registered to the MultiAgentGroup.

* Group rewards are meant to reinforce agents to act in the group's best interest instead of
individual ones, and are treated differently than individual agent rewards during
training. So calling `AddGroupReward()` is not equivalent to calling agent.AddReward() on each agent
in the group.

* You can still add incremental rewards to agents using `Agent.AddReward()` if they are
in a Group. These rewards will only be given to those agents and are received when the
Agent is active.

* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will
not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively.

## Recording Demonstrations

In order to record demonstrations from an agent, add the
Expand Down
33 changes: 32 additions & 1 deletion docs/ML-Agents-Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -553,7 +553,7 @@ In addition to the three environment-agnostic training methods introduced in the
previous section, the ML-Agents Toolkit provides additional methods that can aid
in training behaviors for specific types of environments.

### Training in Multi-Agent Environments with Self-Play
### Training in Competitive Multi-Agent Environments with Self-Play

ML-Agents provides the functionality to train both symmetric and asymmetric
adversarial games with
Expand Down Expand Up @@ -588,6 +588,37 @@ our
[blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
for additional information.

### Training In Cooperative Multi-Agent Environments with MA-POCA

![PushBlock with Agents Working Together](images/cooperative_pushblock.png)

ML-Agents provides the functionality for training cooperative behaviors - i.e.,
groups of agents working towards a common goal, where the success of the individual
is linked to the success of the whole group. In such a scenario, agents typically receive
rewards as a group. For instance, if a team of agents wins a game against an opposing
team, everyone is rewarded - even agents who did not directly contribute to the win. This
makes learning what to do as an individual difficult - you may get a win
for doing nothing, and a loss for doing your best.

In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say "paper coming soon" or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to not say anything. Although I am worried someone will coin the name.

is a novel multi-agent trainer that trains a _centralized critic_, a neural network
that acts as a "coach" for a whole group of agents. You can then give rewards to the team
as a whole, and the agents will learn how best to contribute to achieving that reward.
Agents can _also_ be given rewards individually, and the team will work together to help the
individual achieve those goals. During an episode, agents can be added or removed from the group,
such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die
or are removed from the game), they will still learn whether their actions contributed
to the team winning later, enabling agents to take group-beneficial actions even if
they result in the individual being removed from the game (i.e., self-sacrifice).
MA-POCA can also be combined with self-play to train teams of agents to play against each other.

To learn more about enabling cooperative behaviors for agents in an ML-Agents environment,
check out [this page](Learning-Environment-Design-Agents.md#cooperative-scenarios).

For further reading, MA-POCA builds on previous work in multi-agent cooperative learning
([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf),
among others) to enable the above use-cases.

### Solving Complex Tasks using Curriculum Learning

Curriculum learning is a way of training a machine learning model where more
Expand Down
10 changes: 8 additions & 2 deletions docs/Training-Configuration-File.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@
## Common Trainer Configurations

One of the first decisions you need to make regarding your training run is which
trainer to use: PPO or SAC. There are some training configurations that are
trainer to use: PPO, SAC, or POCA. There are some training configurations that are
common to both trainers (which we review now) and others that depend on the
choice of the trainer (which we review on subsequent sections).

| **Setting** | **Description** |
| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo` or `sac` |
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo`, `sac`, or `poca`. |
| `summary_freq` | (default = `50000`) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
| `time_horizon` | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
| `max_steps` | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count. <br><br>Typical range: `5e5` - `1e7` |
Expand Down Expand Up @@ -72,6 +72,12 @@ the `trainer` setting above).
| `hyperparameters -> steps_per_update` | (default = `1`) Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` |
| `hyperparameters -> reward_signal_num_update` | (default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |

### MA-POCA-specific Configurations
MA-POCA uses the same configurations as PPO, and there are no additional POCA-specific parameters.

**NOTE**: Reward signals other than Extrinsic Rewards have not been extensively tested with MA-POCA,
though they can still be added and used for training on a your-mileage-may-vary basis.

## Reward Signals

The `reward_signals` section enables the specification of settings for both
Expand Down
Binary file added docs/images/cooperative_pushblock.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/groupmanager_teamid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.