-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Split Policy and Optimizer, common Policy for PPO and SAC #3345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…splitpolicyoptimizer
Ran cloud training on this branch and as far as I can tell there are no regressions on the example environments. |
docs/Training-SAC.md
Outdated
used to store the hidden state of the recurrent neural network. This value must | ||
be a multiple of 4, and should scale with the amount of information you expect | ||
used to store the hidden state of the recurrent neural network in the policy. | ||
This value must be divisible by 2, and should scale with the amount of information you expect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value must be a multiple of 2
for consistency with PPO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed - also changed LearningModel to ModelUtils and made the optimizer methods (and SACNetwork methods) private.
@@ -30,8 +30,6 @@ class LearningRateSchedule(Enum): | |||
|
|||
|
|||
class LearningModel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was decided here ?
self.policy.inference_dict["learning_rate"] = self.learning_rate | ||
self.policy.initialize_or_load() | ||
|
||
def create_cc_critic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it private ?
@@ -151,7 +151,6 @@ environment, you can set the following command line options when invoking | |||
[here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more | |||
details. | |||
* `--debug`: Specify this option to enable debug-level logging for some parts of the code. | |||
* `--multi-gpu`: Setting this flag enables the use of multiple GPU's (if available) during training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added to the Migrating.md ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully it will be re-added before the next release (1.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to changelog. I think the plan is to make the base optimizer(s) do multi-gpu without a separate flag.
shape=[None, self.act_size[0]], dtype=tf.float32, name="action_holder" | ||
) | ||
|
||
def _create_dc_actor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of repeated code between this function and _create_cc_actor. Is the need for the two because we concatenate the previous action? Maybe we could put the repeated code in the same place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved some of the repeated code. Offline discussion: Reserve more moving to future PR. Would require remove of previous actions.
class TFOptimizer(Optimizer): # pylint: disable=W0223 | ||
def __init__(self, policy: TFPolicy, trainer_params: Dict[str, Any]): | ||
self.sess = policy.sess | ||
self.policy = policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly concerned about making our TFOptimizer base single policy centric. Might need to build another class on top of Optimizer for multiagent training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conclusion is to insert new single-agent and multi-agent classes between TFOptimizer and e.g. PPOOptimizer. Will be done in future PR.
In an effort to make the Trainer codebase more modular, we're moving algorithm-specific code into a new class (the Optimizer), while the Policy simply maps observations to actions. The Optimizer is passed a Policy and constructs needed networks (e.g. value network) around the Policy. The Trainer then calls update on the Optimizer, not the policy.
There are a small number of gotchas. For one, we're no longer saving the memories from the value network, so they're zeroed out at the beginning of each trajectory.
Added a burn-in of 10% of the sequence length to combat any negative effects here.Furthermore, multi-GPU has been temporarily removed. There will be a MultiGPUPPOOptimizer class that will take care of this functionality.
Things not yet addressed in this PR that will be in following PRs:
policy.use_recurrent
) as public properties and define a TFPolicy interface.Functionality is currently being tested on cloud but thoughts on the code structure would be much appreciated.