Split Policy and Optimizer, common Policy for PPO and SAC #3345

ervteng · 2020-02-03T20:51:11Z

In an effort to make the Trainer codebase more modular, we're moving algorithm-specific code into a new class (the Optimizer), while the Policy simply maps observations to actions. The Optimizer is passed a Policy and constructs needed networks (e.g. value network) around the Policy. The Trainer then calls update on the Optimizer, not the policy.

There are a small number of gotchas. For one, we're no longer saving the memories from the value network, so they're zeroed out at the beginning of each trajectory. ~~Added a burn-in of 10% of the sequence length to combat any negative effects here.~~

Furthermore, multi-GPU has been temporarily removed. There will be a MultiGPUPPOOptimizer class that will take care of this functionality.

Things not yet addressed in this PR that will be in following PRs:

Moving all of the common trainers, policies and models into the common folder
Proper input/output specs for Policy
Set things about the policy (e.g. policy.use_recurrent) as public properties and define a TFPolicy interface.
Re-introducing multi-GPU

Functionality is currently being tested on cloud but thoughts on the code structure would be much appreciated.

…splitpolicyoptimizer

ervteng · 2020-02-20T20:01:51Z

Ran cloud training on this branch and as far as I can tell there are no regressions on the example environments.

vincentpierre · 2020-02-21T00:12:07Z

docs/Training-SAC.md

-used to store the hidden state of the recurrent neural network. This value must
-be a multiple of 4, and should scale with the amount of information you expect
+used to store the hidden state of the recurrent neural network in the policy.
+This value must be divisible by 2, and should scale with the amount of information you expect


This value must be a multiple of 2 for consistency with PPO

Changed - also changed LearningModel to ModelUtils and made the optimizer methods (and SACNetwork methods) private.

vincentpierre · 2020-02-21T00:14:53Z

ml-agents/mlagents/trainers/models.py

@@ -30,8 +30,6 @@ class LearningRateSchedule(Enum):


 class LearningModel:


What was decided here ?

vincentpierre · 2020-02-21T00:16:23Z

ml-agents/mlagents/trainers/ppo/optimizer.py

+            self.policy.inference_dict["learning_rate"] = self.learning_rate
+            self.policy.initialize_or_load()
+
+    def create_cc_critic(


is it private ?

vincentpierre · 2020-02-24T22:46:55Z

docs/Training-ML-Agents.md

@@ -151,7 +151,6 @@ environment, you can set the following command line options when invoking
  [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
  details.
 * `--debug`: Specify this option to enable debug-level logging for some parts of the code.
-* `--multi-gpu`: Setting this flag enables the use of multiple GPU's (if available) during training.


Should this be added to the Migrating.md ?

Hopefully it will be re-added before the next release (1.0)

Added to changelog. I think the plan is to make the base optimizer(s) do multi-gpu without a separate flag.

ml-agents/mlagents/trainers/common/nn_policy.py

andrewcoh · 2020-02-24T23:57:07Z

ml-agents/mlagents/trainers/common/nn_policy.py

+            shape=[None, self.act_size[0]], dtype=tf.float32, name="action_holder"
+        )
+
+    def _create_dc_actor(


There's a bit of repeated code between this function and _create_cc_actor. Is the need for the two because we concatenate the previous action? Maybe we could put the repeated code in the same place

Moved some of the repeated code. Offline discussion: Reserve more moving to future PR. Would require remove of previous actions.

andrewcoh · 2020-02-25T00:14:21Z

ml-agents/mlagents/trainers/common/tf_optimizer.py

+class TFOptimizer(Optimizer):  # pylint: disable=W0223
+    def __init__(self, policy: TFPolicy, trainer_params: Dict[str, Any]):
+        self.sess = policy.sess
+        self.policy = policy


Slightly concerned about making our TFOptimizer base single policy centric. Might need to build another class on top of Optimizer for multiagent training.

Conclusion is to insert new single-agent and multi-agent classes between TFOptimizer and e.g. PPOOptimizer. Will be done in future PR.

Ervin Teng added 30 commits January 17, 2020 10:45

Move some functionality to optimizer

2c6a608

Move some functionality to optimizer-black

6196b64

Move some functionality to optimizer-black

2109b92

More incremental steps to separation

ab727aa

More progress

e0368f3

Move methods into common optimizer

e978ea1

Some more bugfixes

aa2f3a5

Working continuous updates

336993d

Discrete PPO working

f4964bd

Clean up policy files

8584c98

Combined model and policy for PPO

2759bb5

Commit init file

18d2b80

Remove PPO model

128db9c

Simplify creation of optimizer, breaks multi-GPU

42f8cdb

Move optimizer creation to Trainer, fix some of the reward signals

81c4d2f

Change reward signal creation

43d4c5d

Move policy to common location, remove epsilon

6aededb

Fix some typing issues with curiosity

12bd4cd

Unified policy

9f80fc3

SAC CC working

debdde4

Merge branch 'master' into develop-splitpolicyoptimizer

04599e3

Clean up value head creation

bdc4c7e

Use resamp policy for SAC

75d8bb3

Add typing to value head creator

0f0727f

Fix discrete SAC and clean up policy

ca63378

Remove epsilon from everywhere

7f0f27c

Add some typing to optimizer

bc461fb

Zeroed version of LSTM working for PPO

9087433

Revert learn.py

079f8d0

Cleanup LSTM code

b454064

Ervin Teng added 7 commits February 14, 2020 15:29

Merge branch 'master' into develop-splitpolicyoptimizer

a1f6d8e

Merge branch 'master' into develop-splitpolicyoptimizer

d4ff1fc

Fix entropy calculation

98076dc

Add option to not condition sigma on obs

1c631b9

Merge commit 'fbcdd83c087135f870e785cc72e5ff9a7e898e3a' into develop-…

921b76f

…splitpolicyoptimizer

Fix tensor names

6f10ecd

Remove and rename tf_optimizer

6ecebe5

ervteng requested review from chriselion and vincentpierre February 20, 2020 19:58

Fix PPO optimizer creation

427fe05

Ervin Teng added 3 commits February 20, 2020 15:20

Make TF graph seed deterministic

67c8f4d

extend meta curriculum test steps

478040a

Move learning rate reporting

9bcb0f3

vincentpierre reviewed Feb 21, 2020

View reviewed changes

Ervin Teng added 6 commits February 20, 2020 16:17

Add add_policy docstrings

781d85b

Add docstrings for model

545ea23

Make create critic methods private

5f9821a

Make create losses private

8761d24

Fix docs for consistency

21fb0d8

Rename LearningModel to ModelUtils

a13c9ba

vincentpierre approved these changes Feb 24, 2020

View reviewed changes

Add removal of multi-gpu to changelog and migrating

4a1830d

andrewcoh approved these changes Feb 25, 2020

View reviewed changes

Ervin Teng added 3 commits February 24, 2020 16:43

Rename resample to reparameterize

c9cdda6

Make value estimate method private

a1857da

Move encoder creation to separate function

2ad2ddb

ervteng merged commit 4058e95 into master Feb 25, 2020

delete-merged-branch bot deleted the develop-splitpolicyoptimizer branch February 25, 2020 01:56

github-actions bot locked as resolved and limited conversation to collaborators May 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split Policy and Optimizer, common Policy for PPO and SAC #3345

Split Policy and Optimizer, common Policy for PPO and SAC #3345

Uh oh!

ervteng commented Feb 3, 2020 •

edited

Loading

Uh oh!

ervteng commented Feb 20, 2020

Uh oh!

vincentpierre Feb 21, 2020

Uh oh!

ervteng Feb 21, 2020

Uh oh!

vincentpierre Feb 21, 2020

Uh oh!

vincentpierre Feb 21, 2020

Uh oh!

vincentpierre Feb 24, 2020

Uh oh!

ervteng Feb 24, 2020

Uh oh!

ervteng Feb 24, 2020

Uh oh!

ervteng Feb 24, 2020

Uh oh!

Uh oh!

andrewcoh Feb 24, 2020

Uh oh!

ervteng Feb 25, 2020

Uh oh!

andrewcoh Feb 25, 2020

Uh oh!

ervteng Feb 25, 2020

Uh oh!

Uh oh!

		@@ -30,8 +30,6 @@ class LearningRateSchedule(Enum):


		class LearningModel:

Split Policy and Optimizer, common Policy for PPO and SAC #3345

Split Policy and Optimizer, common Policy for PPO and SAC #3345

Uh oh!

Conversation

ervteng commented Feb 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ervteng commented Feb 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ervteng commented Feb 3, 2020 •

edited

Loading