-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Easily Customizable Optimizer.minimize #234
base: master
Are you sure you want to change the base?
Conversation
In tensorflow Addons, we have multiple wrappers/hooks that need to work with arbitrary optimizers. So we need to make sure the API covers those use cases:
Thank you in advance. After this RFC is merged and implemented, we should migrate those 4 wrappers to the new API. Link to the nicely formatted document: https://github.com/omalleyt12/community/blob/patch-1/rfcs/2020-04-20-Optimizer-minimize.md
If possible, could you all write some pseudo code of what each wrapper implementation would look like with this new API? That would allow us to answer this last question. |
Please send out an email to tensorflow@developers.org with the announcement of this pull request and give at least a two week review period per: https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md |
Thanks for the ping @gabrieldemarmiesse! Our API use case:
The challenge here is that the set of variables to be decayed might not be the same as the set of trainable variables (e.g., we might want to exclude batch norm parameters). For maximum flexibility Possible solutions:
|
Thanks for looping these people in @gabrieldemarmiesse! @PhilJd Thanks for the comments! Re weight decay: In general you can still wrap class WeightDecayOptimizer(Optimizer):
def minimize(self, loss, variables, tape, decay_variables=None):
if decay_variables is None:
decay_variables = variables
decay_variables -= self.weight_decay * decay_variables
super(WeightDecayOptimizer, self).minimize(loss, variables, tape) The
This would work, although it's worth noting as well that oftentimes,
This wouldn't work with the default In general there's probably a question of whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- Some steps are configurable via Optimizer constructor arguments. Why not all? We should be consistent.
- The API expects users to override private methods. We should avoid this. Any method that is meant to be overridable should be public.
Recommendation:
- Keep the methods
transform_loss
,transform_unaggregated_gradients
,transform_gradients
private, but make them all configurable via constructor arguments. Since they can be configured in the constructor, they should not be overridden by users (users can override the constructor). - Make
apply_updates
&get_gradients
public (they're not meant to be configurable via a constructor argument), since they are meant to be overridden. - Make method signatures consistent by having
get_gradients
returngrads_and_vars
instead of just gradients (since all subsequent methods expectgrads_and_vars
) - Consider renaming
transform_unaggregated_gradients
andtransform_gradients
to highlight the difference between the two (e.g. in a non-distributed setting, is ittransform_unaggregated_gradients
that is applied to the gradients, ortransform_gradients
, or both?)
grads_and_vars = self._transform_gradients(grads_and_vars) | ||
return grads_and_vars | ||
|
||
def apply_gradients(self, grads_and_vars, aggregate=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between apply_updates
and apply_gradients
may be confusing. We should consider renaming apply_updates
to highlight the difference (since we can't rename apply_gradients
.
self.transform_gradients_fns = transform_gradients_fns | ||
|
||
def _transform_loss(self, loss): | ||
# Can be overridden in subclasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can subclassers rely on underscored method names? If these are backwards-compat, we should make it clear with no underscore. If they are not, we should not encourage subclassers to rely on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I prefer that these things are constructor arguments instead of methods to be overridden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alextp how about we design separate the TF/Keras logic from the "user" logic.
Some "pseudo-code" below:
# Abstract Class
class _WrappingOptimizer():
# Keep any inner workings here
def _pre_grad_aggregation(self, *a, **kw):
# do smthg
self.transform_loss(a, kw)
# do smthg
def _aggregate_grad(self, *a, **kw):
pass
...
class WrappingOptimizer(_WrappingOptimizer):
def transform_loss(self, *a, **kw):
pass
def aggregate_grads(self, *a, **kw):
pass
Not sure if it is clear ;)
We could separate how Keras will use and call the different methods. And what we expect the user to overwrite and modify.
That way we can keep a stable public API and change as often as required how we use the exposed API.
Going that way would probably be a little more readable. Passing around functions makes it quite hard to read the code ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the methods to public for now and added it as a discussion item for whether they should be private or public.
Re methods vs init arguments, my opinion is this: init arguments are great for the simple use cases we expect most users to have (e.g. gradient clipping, custom weight decay, aggregating by mean rather than by sum, etc). In these examples, each init argument is self-contained.
However, for more advanced use cases like LossScaling and differential privacy optimizers, the transformations needed at each step of the process (loss transform, unaggregated gradient transform, gradient aggregation, aggregated gradient transform) are tightly coupled. In these cases, having a subclass that contains all of this tightly coupled logic seems to make the most sense to me.
I think the current design, where the two most common transforms (aggregation and post-aggregations transformation) can be passed as __init__
arguments, and every discrete step of the minimize
process has its own overridable method, achieves the best of the both worlds: simple use cases don't require subclassing, and advanced users have maximum flexibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reedwm re related comment
# Can be overridden in subclasses | ||
return loss | ||
|
||
def _get_gradients(self, loss, variables, tape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docstrings to this proposal? It's not obvious from reading through what the contract for each of these methods is. What can/must this return? What are each of these args? As a subclasser, those are important details to understand.
/CC @DEKHTIARJonathan, @tgaddair. This API will be relevant to Horovod's distributed optimizer. |
Thanks @reedwm, much appreciated to be informed :) I actually truely like the idea of this RFC. I have defended many points of the design made in this RFC in this PR: tensorflow/tensorflow#36398. About:
I would be in favor of this. Having OptimizerWrapper would greatly improved the maintability of AMP and Horovod and allow us to provide "a standard" way to plug into TensorFlow for anyone thinking of doing it. And on the core API side, it simplifies API support because you can state that the public API of the Overall, I'm really enthusiastic about this RFC. Sounds like a great improvement 👍 |
self.transform_gradients_fns = transform_gradients_fns | ||
|
||
def _transform_loss(self, loss): | ||
# Can be overridden in subclasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.
def apply_gradients(self, grads_and_vars, aggregate=True): | ||
if aggregate: | ||
grads_and_vars = self._aggregate_gradients(grads_and_vars) | ||
grads_and_vars = self._transform_gradients(grads_and_vars) # No-op by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again for this check: shouldn't this transformation happen regardless of aggregation?
Are we guarding it because there's the dynamic where aggregate_gradients
must be called once to avoid double-aggregation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since transform_gradients
is a method that applies only to aggregated gradients, in this case it can't be performed w/o aggregation
@hyang0129 could you take a look at this RFC to see if it would make tensorflow/addons#969 easier to implement? Maybe add some comments if you have ideas. @Squadrick @CyberZHG friendly ping to see if this rfc can make the implementation of Stochastic Weight Averaging , moving average wrapper and lookahead mechanism simpler in tensorflow addons. |
@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient. |
@hyang0129 Agreed, the loss scaling optimizer example doesn't actually scale the gradients. What it does is that it temporarily scales up the loss to compute gradients in a numerically stable way, and then unscales them before applying Variable updates For other, user-written gradient transformations such as gradient clipping, ensuring that the math is doing what they want is up to them |
@gabrieldemarmiesse Sorry about the delay. Moving Once |
@@ -276,6 +284,9 @@ Or, if backwards compatibility is not needed, simply: | |||
optimizer = tf.keras.optimizers.Adam(1e-3, aggregate_gradients=horovod.aggregate) | |||
``` | |||
|
|||
## `OptimizerWrapper` | |||
|
|||
With this proposal, we should also release an `OptimizerWrapper` class. This class will make it easier for developers to create subclasses that wrap an `Optimizer` while providing additional functionality, such as mixed-precision, Horovod, or differential privacy use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the wrapper go through separate review? What APIs will it provide?
if aggregate_gradients is None: | ||
aggregate_gradients = all_reduce_sum | ||
self.aggregate_gradients_fn = aggregate_gradients | ||
self.transform_gradients_fns = transform_gradients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unexpected that aggregate_gradients is a single fn, and transform_gradients is a list of fns. The naming with fn and fns is at least more clear, though one wonders if it's also worth making the two args consistent. Eg, allowing list-wise composition with aggregate gradients, or, better yet, requiring one fn for transform and expecting that users compose the fn themselves. That is--
# Current input:
transform_gradients_fns = [fn_1, fn_2]
# Composition on the user side
def transform_gradients_composed(grads_and_vars):
return fn_2(fn_1(grads_and_vars))
Optimizer(transform_gradients=transform_gradients_composed)
Would that work, or is there some reason that's less preferable?
## Questions and Discussion Topics | ||
|
||
(1) Should these new hooks be private or public methods? | ||
(2) Should we create an initializer argument for each hook, or only for the ones we expect most users to need (`aggregate_gradients` and `transform_gradients`)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.1 Or no constructors or no overwritables?
3. What are the contracts for the constructor arg functions? Single fns, lists, something more formal that we can better error/type-check?
@omalleyt12 Please post the notes from the design review here. |
@omalleyt12 A gentle ping so we can finish this up. |
@omalleyt12 told me that this design will likely need another round of review at some point as there were differing opinions. |
Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you! |
|
@adriendoerig if you use Horovod that would be already quite easy to perform :) |
|
||
This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit). | ||
|
||
Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two line references here should probably not reference master branch: they no longer point to the code you want to (as far as I can tell)
Just wondering, why clip by global norm also not easily supported for distribution strategy? It would be really helpful as it keeps the gradient direction unchanged. |
@omalleyt12, @fchollet Are you still pursuing this or should this be closed out? |
Objective
Create an
Optimizer
API that givesOptimizer
subclasses full control of gradient updates. The API should ensureOptimizer
s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how theOptimizer
chooses to:Scale losses and gradients
Aggregate gradients
Clip gradients
etc
We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.
By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.
Specific use cases considered:
Gradient clipping
Mixed precision
Horovod