Skip to content

Split Policy and Optimizer, common Policy for PPO and SAC #3345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 98 commits into from
Feb 25, 2020

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Feb 3, 2020

In an effort to make the Trainer codebase more modular, we're moving algorithm-specific code into a new class (the Optimizer), while the Policy simply maps observations to actions. The Optimizer is passed a Policy and constructs needed networks (e.g. value network) around the Policy. The Trainer then calls update on the Optimizer, not the policy.

There are a small number of gotchas. For one, we're no longer saving the memories from the value network, so they're zeroed out at the beginning of each trajectory. Added a burn-in of 10% of the sequence length to combat any negative effects here.

Furthermore, multi-GPU has been temporarily removed. There will be a MultiGPUPPOOptimizer class that will take care of this functionality.

Things not yet addressed in this PR that will be in following PRs:

  • Moving all of the common trainers, policies and models into the common folder
  • Proper input/output specs for Policy
  • Set things about the policy (e.g. policy.use_recurrent) as public properties and define a TFPolicy interface.
  • Re-introducing multi-GPU

Functionality is currently being tested on cloud but thoughts on the code structure would be much appreciated.

@ervteng
Copy link
Contributor Author

ervteng commented Feb 20, 2020

Ran cloud training on this branch and as far as I can tell there are no regressions on the example environments.

used to store the hidden state of the recurrent neural network. This value must
be a multiple of 4, and should scale with the amount of information you expect
used to store the hidden state of the recurrent neural network in the policy.
This value must be divisible by 2, and should scale with the amount of information you expect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value must be a multiple of 2 for consistency with PPO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - also changed LearningModel to ModelUtils and made the optimizer methods (and SACNetwork methods) private.

@@ -30,8 +30,6 @@ class LearningRateSchedule(Enum):


class LearningModel:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was decided here ?

self.policy.inference_dict["learning_rate"] = self.learning_rate
self.policy.initialize_or_load()

def create_cc_critic(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it private ?

@@ -151,7 +151,6 @@ environment, you can set the following command line options when invoking
[here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
details.
* `--debug`: Specify this option to enable debug-level logging for some parts of the code.
* `--multi-gpu`: Setting this flag enables the use of multiple GPU's (if available) during training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added to the Migrating.md ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully it will be re-added before the next release (1.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to changelog. I think the plan is to make the base optimizer(s) do multi-gpu without a separate flag.

shape=[None, self.act_size[0]], dtype=tf.float32, name="action_holder"
)

def _create_dc_actor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of repeated code between this function and _create_cc_actor. Is the need for the two because we concatenate the previous action? Maybe we could put the repeated code in the same place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved some of the repeated code. Offline discussion: Reserve more moving to future PR. Would require remove of previous actions.

class TFOptimizer(Optimizer): # pylint: disable=W0223
def __init__(self, policy: TFPolicy, trainer_params: Dict[str, Any]):
self.sess = policy.sess
self.policy = policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned about making our TFOptimizer base single policy centric. Might need to build another class on top of Optimizer for multiagent training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion is to insert new single-agent and multi-agent classes between TFOptimizer and e.g. PPOOptimizer. Will be done in future PR.

@ervteng ervteng merged commit 4058e95 into master Feb 25, 2020
@delete-merged-branch delete-merged-branch bot deleted the develop-splitpolicyoptimizer branch February 25, 2020 01:56
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants