Skip to content

Commit 72624f0

Browse files
author
Marwan Mattar
committed
Squashed commit of the following:
commit 3fed09d Author: Ervin T <ervin@unity3d.com> Date: Mon Apr 20 13:21:28 2020 -0700 [bug-fix] Increase buffer size for SAC tests (#3813) commit 99ed28e Author: Ervin T <ervin@unity3d.com> Date: Mon Apr 20 13:06:39 2020 -0700 [refactor] Run Trainers in separate threads (#3690) commit 52b7d2e Author: Chris Elion <chris.elion@unity3d.com> Date: Mon Apr 20 12:20:45 2020 -0700 update upm-ci-utils source (#3811) commit 89e4804 Author: Vincent-Pierre BERGES <vincentpierre@unity3d.com> Date: Mon Apr 20 12:06:59 2020 -0700 Removing done from the llapi doc (#3810)
1 parent 1c85140 commit 72624f0

23 files changed

+265
-126
lines changed

.yamato/com.unity.ml-agents-pack.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ pack:
55
image: package-ci/ubuntu:stable
66
flavor: b1.large
77
commands:
8-
- npm install upm-ci-utils@stable -g --registry https://api.bintray.com/npm/unity/unity-npm
8+
- npm install upm-ci-utils@stable -g --registry https://artifactory.prd.cds.internal.unity3d.com/artifactory/api/npm/upm-npm
99
- upm-ci package pack --package-path com.unity.ml-agents
1010
artifacts:
1111
packages:

.yamato/com.unity.ml-agents-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ test_{{ platform.name }}_{{ editor.version }}:
3333
image: {{ platform.image }}
3434
flavor: {{ platform.flavor}}
3535
commands:
36-
- npm install upm-ci-utils@stable -g --registry https://api.bintray.com/npm/unity/unity-npm
36+
- npm install upm-ci-utils@stable -g --registry https://artifactory.prd.cds.internal.unity3d.com/artifactory/api/npm/upm-npm
3737
- upm-ci package test -u {{ editor.version }} --package-path com.unity.ml-agents {{ editor.coverageOptions }}
3838
- python ml-agents/tests/yamato/check_coverage_percent.py upm-ci~/test-results/ {{ editor.minCoveragePct }}
3939
artifacts:

com.unity.ml-agents/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@ and this project adheres to
6464
overwrite the existing files. (#3705)
6565
- `StackingSensor` was changed from `internal` visibility to `public`
6666
- Updated Barracuda to 0.6.3-preview.
67+
- Model updates can now happen asynchronously with environment steps for better performance. (#3690)
68+
- `num_updates` and `train_interval` for SAC were replaced with `steps_per_update`. (#3690)
6769

6870
### Bug Fixes
6971

config/sac_trainer_config.yaml

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,7 @@ default:
1010
max_steps: 5.0e5
1111
memory_size: 128
1212
normalize: false
13-
num_update: 1
14-
train_interval: 1
13+
steps_per_update: 10
1514
num_layers: 2
1615
time_horizon: 64
1716
sequence_length: 64
@@ -30,11 +29,10 @@ FoodCollector:
3029
buffer_size: 500000
3130
max_steps: 2.0e6
3231
init_entcoef: 0.05
33-
train_interval: 1
3432

3533
Bouncer:
3634
normalize: true
37-
max_steps: 2.0e6
35+
max_steps: 1.0e6
3836
num_layers: 2
3937
hidden_units: 64
4038
summary_freq: 20000
@@ -43,7 +41,7 @@ PushBlock:
4341
max_steps: 2e6
4442
init_entcoef: 0.05
4543
hidden_units: 256
46-
summary_freq: 60000
44+
summary_freq: 100000
4745
time_horizon: 64
4846
num_layers: 2
4947

@@ -159,10 +157,10 @@ CrawlerStatic:
159157
normalize: true
160158
time_horizon: 1000
161159
batch_size: 256
162-
train_interval: 2
160+
steps_per_update: 20
163161
buffer_size: 500000
164162
buffer_init_steps: 2000
165-
max_steps: 5e6
163+
max_steps: 3e6
166164
summary_freq: 30000
167165
init_entcoef: 1.0
168166
num_layers: 3
@@ -178,9 +176,9 @@ CrawlerDynamic:
178176
batch_size: 256
179177
buffer_size: 500000
180178
summary_freq: 30000
181-
train_interval: 2
179+
steps_per_update: 20
182180
num_layers: 3
183-
max_steps: 1e7
181+
max_steps: 5e6
184182
hidden_units: 512
185183
reward_signals:
186184
extrinsic:
@@ -195,7 +193,7 @@ Walker:
195193
max_steps: 2e7
196194
summary_freq: 30000
197195
num_layers: 4
198-
train_interval: 2
196+
steps_per_update: 30
199197
hidden_units: 512
200198
reward_signals:
201199
extrinsic:
@@ -208,6 +206,7 @@ Reacher:
208206
batch_size: 128
209207
buffer_size: 500000
210208
max_steps: 2e7
209+
steps_per_update: 20
211210
summary_freq: 60000
212211

213212
Hallway:
@@ -216,7 +215,7 @@ Hallway:
216215
hidden_units: 128
217216
memory_size: 128
218217
init_entcoef: 0.1
219-
max_steps: 1.0e7
218+
max_steps: 5.0e6
220219
summary_freq: 10000
221220
time_horizon: 64
222221
use_recurrent: true

docs/Migrating.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ double-check that the versions are in the same. The versions can be found in
3333
- The signature of `Agent.Heuristic()` was changed to take a `float[]` as a
3434
parameter, instead of returning the array. This was done to prevent a common
3535
source of error where users would return arrays of the wrong size.
36+
- `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
37+
3638

3739
### Steps to Migrate
3840

@@ -54,6 +56,8 @@ double-check that the versions are in the same. The versions can be found in
5456
- If your Agent class overrides `Heuristic()`, change the signature to
5557
`public override void Heuristic(float[] actionsOut)` and assign values to
5658
`actionsOut` instead of returning an array.
59+
- Set `steps_per_update` to be around equal to the number of agents in your environment,
60+
times `num_updates` and divided by `train_interval`.
5761

5862
## Migrating from 0.14 to 0.15
5963

docs/Python-API.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -149,8 +149,6 @@ A `DecisionSteps` has the following fields :
149149
`env.step()`).
150150
- `reward` is a float vector of length batch size. Corresponds to the
151151
rewards collected by each agent since the last simulation step.
152-
- `done` is an array of booleans of length batch size. Is true if the
153-
associated Agent was terminated during the last simulation step.
154152
- `agent_id` is an int vector of length batch size containing unique
155153
identifier for the corresponding Agent. This is used to track Agents
156154
across simulation steps.
@@ -174,8 +172,6 @@ A `DecisionStep` has the following fields:
174172
(Each array has one less dimension than the arrays in `DecisionSteps`)
175173
- `reward` is a float. Corresponds to the rewards collected by the agent
176174
since the last simulation step.
177-
- `done` is a bool. Is true if the Agent was terminated during the last
178-
simulation step.
179175
- `agent_id` is an int and an unique identifier for the corresponding Agent.
180176
- `action_mask` is an optional list of one dimensional array of booleans.
181177
Only available in multi-discrete action space type.
@@ -197,8 +193,6 @@ A `TerminalSteps` has the following fields :
197193
`env.step()`).
198194
- `reward` is a float vector of length batch size. Corresponds to the
199195
rewards collected by each agent since the last simulation step.
200-
- `done` is an array of booleans of length batch size. Is true if the
201-
associated Agent was terminated during the last simulation step.
202196
- `agent_id` is an int vector of length batch size containing unique
203197
identifier for the corresponding Agent. This is used to track Agents
204198
across simulation steps.
@@ -219,8 +213,6 @@ A `TerminalStep` has the following fields:
219213
(Each array has one less dimension than the arrays in `TerminalSteps`)
220214
- `reward` is a float. Corresponds to the rewards collected by the agent
221215
since the last simulation step.
222-
- `done` is a bool. Is true if the Agent was terminated during the last
223-
simulation step.
224216
- `agent_id` is an int and an unique identifier for the corresponding Agent.
225217
- `max_step` is a bool. Is true if the Agent reached its maximum number of
226218
steps during the last simulation step.

docs/Training-ML-Agents.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,10 +158,10 @@ Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning
158158
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC |
159159
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC |
160160
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC |
161-
| train_interval | How often to update the agent. | SAC |
162-
| num_update | Number of mini-batches to update the agent with during each update. | SAC |
161+
| steps_per_update | Ratio of agent steps per mini-batch update. | SAC |
163162
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
164163
| init_path | Initialize trainer from a previously saved model. | PPO, SAC |
164+
| threaded | Run the trainer in a parallel thread from the environment steps. (Default: true) | PPO, SAC |
165165

166166
For specific advice on setting hyperparameters based on the type of training you
167167
are conducting, see:

docs/Training-PPO.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -300,6 +300,15 @@ This option is provided in case you want to initialize different behaviors from
300300
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
301301
all models from the same run.
302302

303+
### (Optional) Advanced: Disable Threading
304+
305+
By default, PPO model updates can happen while the environment is being stepped. This violates the
306+
[on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms)
307+
assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the
308+
strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`.
309+
310+
Default Value: `true`
311+
303312
## Training Statistics
304313

305314
To view training statistics, use TensorBoard. For information on launching and

docs/Training-SAC.md

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -40,19 +40,18 @@ ML-Agents provides two reward signals by default, the Extrinsic (environment) re
4040
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
4141
environments.
4242

43-
#### Number of Updates for Reward Signal (Optional)
43+
#### Steps Per Update for Reward Signal (Optional)
4444

45-
`reward_signal_num_update` for the reward signals corresponds to the number of mini batches sampled
46-
and used for updating the reward signals during each
47-
update. By default, we update the reward signals once every time the main policy is updated.
45+
`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
46+
and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
4847
However, to imitate the training procedure in certain imitation learning papers (e.g.
4948
[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
50-
we may want to update the policy N times, then update the reward signal (GAIL) M times.
51-
We can change `train_interval` and `num_update` of SAC to N, as well as `reward_signal_num_update`
52-
under `reward_signals` to M to accomplish this. By default, `reward_signal_num_update` is set to
53-
`num_update`.
49+
we may want to update the reward signal (GAIL) M times for every update of the policy.
50+
We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update`
51+
under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to
52+
`steps_per_update`.
5453

55-
Typical Range: `num_update`
54+
Typical Range: `steps_per_update`
5655

5756
### Buffer Size
5857

@@ -106,17 +105,22 @@ there may not be any new interesting information between steps, and `train_inter
106105

107106
Typical Range: `1` - `5`
108107

109-
### Number of Updates
108+
### Steps Per Update
110109

111-
`num_update` corresponds to the number of mini batches sampled and used for training during each
112-
training event. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
113-
replay buffer, and using this mini batch to update the models. Typically, this can be left at 1.
114-
However, to imitate the training procedure in certain papers (e.g.
115-
[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
116-
we may want to update N times with different mini batches before grabbing additional samples.
117-
We can change `train_interval` and `num_update` to N to accomplish this.
110+
`steps_per_update` corresponds to the average ratio of agent steps (actions) taken to updates made of the agent's
111+
policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
112+
replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after
113+
exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps.
114+
115+
Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will
116+
improve sample efficiency (reduce the number of steps required to train)
117+
but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example
118+
environments) `steps_per_update` equal to the number of agents in the scene is a good balance.
119+
For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed.
120+
We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will
121+
usually result in a slowdown unless the environment is very slow.
118122

119-
Typical Range: `1`
123+
Typical Range: `1` - `20`
120124

121125
### Tau
122126

ml-agents/mlagents/trainers/agent_processor.py

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import sys
2-
from typing import List, Dict, Deque, TypeVar, Generic, Tuple, Any, Union
3-
from collections import defaultdict, Counter, deque
2+
from typing import List, Dict, TypeVar, Generic, Tuple, Any, Union
3+
from collections import defaultdict, Counter
4+
import queue
45

56
from mlagents_envs.base_env import (
67
DecisionSteps,
@@ -229,26 +230,53 @@ class Empty(Exception):
229230

230231
pass
231232

232-
def __init__(self, behavior_id: str, maxlen: int = 1000):
233+
def __init__(self, behavior_id: str, maxlen: int = 20):
233234
"""
234235
Initializes an AgentManagerQueue. Note that we can give it a behavior_id so that it can be identified
235236
separately from an AgentManager.
236237
"""
237-
self.maxlen: int = maxlen
238-
self.queue: Deque[T] = deque(maxlen=self.maxlen)
239-
self.behavior_id = behavior_id
238+
self._maxlen: int = maxlen
239+
self._queue: queue.Queue = queue.Queue(maxsize=maxlen)
240+
self._behavior_id = behavior_id
241+
242+
@property
243+
def maxlen(self):
244+
"""
245+
The maximum length of the queue.
246+
:return: Maximum length of the queue.
247+
"""
248+
return self._maxlen
249+
250+
@property
251+
def behavior_id(self):
252+
"""
253+
The Behavior ID of this queue.
254+
:return: Behavior ID associated with the queue.
255+
"""
256+
return self._behavior_id
257+
258+
def qsize(self) -> int:
259+
"""
260+
Returns the approximate size of the queue. Note that values may differ
261+
depending on the underlying queue implementation.
262+
"""
263+
return self._queue.qsize()
240264

241265
def empty(self) -> bool:
242-
return len(self.queue) == 0
266+
return self._queue.empty()
243267

244268
def get_nowait(self) -> T:
269+
"""
270+
Gets the next item from the queue, throwing an AgentManagerQueue.Empty exception
271+
if the queue is empty.
272+
"""
245273
try:
246-
return self.queue.popleft()
247-
except IndexError:
274+
return self._queue.get_nowait()
275+
except queue.Empty:
248276
raise self.Empty("The AgentManagerQueue is empty.")
249277

250278
def put(self, item: T) -> None:
251-
self.queue.append(item)
279+
self._queue.put(item)
252280

253281

254282
class AgentManager(AgentProcessor):
@@ -268,8 +296,10 @@ def __init__(
268296
self.trajectory_queue: AgentManagerQueue[Trajectory] = AgentManagerQueue(
269297
self.behavior_id
270298
)
299+
# NOTE: we make policy queues of infinite length to avoid lockups of the trainers.
300+
# In the environment manager, we make sure to empty the policy queue before continuing to produce steps.
271301
self.policy_queue: AgentManagerQueue[Policy] = AgentManagerQueue(
272-
self.behavior_id
302+
self.behavior_id, maxlen=0
273303
)
274304
self.publish_trajectory_queue(self.trajectory_queue)
275305

ml-agents/mlagents/trainers/env_manager.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,17 @@ def advance(self):
8888
if self.first_step_infos is not None:
8989
self._process_step_infos(self.first_step_infos)
9090
self.first_step_infos = None
91-
# Get new policies if found
91+
# Get new policies if found. Always get the latest policy.
9292
for brain_name in self.external_brains:
93+
_policy = None
9394
try:
94-
_policy = self.agent_managers[brain_name].policy_queue.get_nowait()
95-
self.set_policy(brain_name, _policy)
95+
# We make sure to empty the policy queue before continuing to produce steps.
96+
# This halts the trainers until the policy queue is empty.
97+
while True:
98+
_policy = self.agent_managers[brain_name].policy_queue.get_nowait()
9699
except AgentManagerQueue.Empty:
97-
pass
100+
if _policy is not None:
101+
self.set_policy(brain_name, _policy)
98102
# Step the environment
99103
new_step_infos = self._step()
100104
# Add to AgentProcessor

ml-agents/mlagents/trainers/ghost/trainer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ def advance(self) -> None:
223223
# We grab at most the maximum length of the queue.
224224
# This ensures that even if the queue is being filled faster than it is
225225
# being emptied, the trajectories in the queue are on-policy.
226-
for _ in range(trajectory_queue.maxlen):
226+
for _ in range(trajectory_queue.qsize()):
227227
t = trajectory_queue.get_nowait()
228228
# adds to wrapped trainers queue
229229
internal_trajectory_queue.put(t)
@@ -233,7 +233,7 @@ def advance(self) -> None:
233233
else:
234234
# Dump trajectories from non-learning policy
235235
try:
236-
for _ in range(trajectory_queue.maxlen):
236+
for _ in range(trajectory_queue.qsize()):
237237
t = trajectory_queue.get_nowait()
238238
# count ghost steps
239239
self.ghost_step += len(t.steps)

ml-agents/mlagents/trainers/ppo/trainer.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,7 @@ def _update_policy(self):
219219
for stat, val in update_stats.items():
220220
self._stats_reporter.add_stat(stat, val)
221221
self._clear_update_buffer()
222+
return True
222223

223224
def create_policy(
224225
self, parsed_behavior_id: BehaviorIdentifiers, brain_parameters: BrainParameters

0 commit comments

Comments
 (0)