Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Easily Customizable Optimizer.minimize #234

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

omalleyt12
Copy link
Contributor

@omalleyt12 omalleyt12 commented Apr 20, 2020

Status In Revision
RFC # 234
Author(s) omalleyt12@
Sponsor apassos@, fchollet@, karmel@
Updated 2020-04-20

Objective

Create an Optimizer API that gives Optimizer subclasses full control of gradient updates. The API should ensure Optimizers can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the Optimizer chooses to:

  • Scale losses and gradients

  • Aggregate gradients

  • Clip gradients

  • etc

We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.

By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.

Specific use cases considered:

  • Gradient clipping

  • Mixed precision

  • Horovod

@gabrieldemarmiesse
Copy link
Member

gabrieldemarmiesse commented Apr 20, 2020

In tensorflow Addons, we have multiple wrappers/hooks that need to work with arbitrary optimizers. So we need to make sure the API covers those use cases:

Thank you in advance. After this RFC is merged and implemented, we should migrate those 4 wrappers to the new API.

Link to the nicely formatted document: https://github.com/omalleyt12/community/blob/patch-1/rfcs/2020-04-20-Optimizer-minimize.md

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper?

If possible, could you all write some pseudo code of what each wrapper implementation would look like with this new API? That would allow us to answer this last question.

@ematejska
Copy link
Contributor

Please send out an email to tensorflow@developers.org with the announcement of this pull request and give at least a two week review period per: https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md

@ematejska ematejska added the RFC: Proposed RFC Design Document label Apr 20, 2020
@PhilJd
Copy link

PhilJd commented Apr 21, 2020

Thanks for the ping @gabrieldemarmiesse!
Unfortunately, I've never used keras, so I'm missing a lot of context and I might oversee some solution to this - I'd be super happy about some comments on the propsals below :)

Our API use case:
In general, the idea of decoupled weight decay is to decouple the decaying variable update from the optimizer variable update. While weight decay is equal to L2 penalty on weights for some optimizers, this is not true for adaptive optimizers such as Adam. So the more general solution would be something like (pseudo code)

vars = optimizer.apply_gradients(grads_vars)
vars -= weight_decay * vars

The challenge here is that the set of variables to be decayed might not be the same as the set of trainable variables (e.g., we might want to exclude batch norm parameters). For maximum flexibility
we've been wrapping the optimizers minimize method and allowed to pass a list with the variables to decay as minimize(loss, var_list, ..., decay_var_list=None).

Possible solutions:

  • Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.
  • Create a second optimzier that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)
  • Always decay all variables. To not decay some variables, one would need to create separate optimizers
    for those. Again, I don't know if using two or more different optimizers is possible in Keras (especially with model.fit).

@omalleyt12
Copy link
Contributor Author

Thanks for looping these people in @gabrieldemarmiesse!

@PhilJd Thanks for the comments!

Re weight decay: In general you can still wrap Optimizer.minimize to perform this:

class WeightDecayOptimizer(Optimizer):

  def minimize(self, loss, variables, tape, decay_variables=None):
    if decay_variables is None:
      decay_variables = variables
   decay_variables -= self.weight_decay * decay_variables
   super(WeightDecayOptimizer, self).minimize(loss, variables, tape)

The decay_variables argument would not be passed in the default Model.fit, so only the use case where the variables and decay_variables are the same would be supported. This is the case for Model.fit today though, so that wouldn't be a regression. Users could also override Model.train_step (which allows users to provide their own custom training step logic) to support this use case if they want.

Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.

This would work, although it's worth noting as well that oftentimes, Variables are created when the Model is first called on data, so for most workflows the Model Variables aren't guaranteed to exist when the Optimizer is created.

Create a second optimizer that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)

This wouldn't work with the default Model.fit logic, whereas the decay_variables argument pattern at least works as long as the decay_variables and variables are the same.

In general there's probably a question of whether Model.fit should have a separate configuration for how to decay variables, since it seems something orthogonal to the Optimizer (in that IIUC, it never needs to know about the gradients)

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  • Some steps are configurable via Optimizer constructor arguments. Why not all? We should be consistent.
  • The API expects users to override private methods. We should avoid this. Any method that is meant to be overridable should be public.

Recommendation:

  • Keep the methods transform_loss, transform_unaggregated_gradients, transform_gradients private, but make them all configurable via constructor arguments. Since they can be configured in the constructor, they should not be overridden by users (users can override the constructor).
  • Make apply_updates & get_gradients public (they're not meant to be configurable via a constructor argument), since they are meant to be overridden.
  • Make method signatures consistent by having get_gradients return grads_and_vars instead of just gradients (since all subsequent methods expect grads_and_vars)
  • Consider renaming transform_unaggregated_gradients and transform_gradients to highlight the difference between the two (e.g. in a non-distributed setting, is it transform_unaggregated_gradients that is applied to the gradients, or transform_gradients, or both?)

grads_and_vars = self._transform_gradients(grads_and_vars)
return grads_and_vars

def apply_gradients(self, grads_and_vars, aggregate=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between apply_updates and apply_gradients may be confusing. We should consider renaming apply_updates to highlight the difference (since we can't rename apply_gradients.

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
self.transform_gradients_fns = transform_gradients_fns

def _transform_loss(self, loss):
# Can be overridden in subclasses
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can subclassers rely on underscored method names? If these are backwards-compat, we should make it clear with no underscore. If they are not, we should not encourage subclassers to rely on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I prefer that these things are constructor arguments instead of methods to be overridden.

Copy link

@DEKHTIARJonathan DEKHTIARJonathan May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alextp how about we design separate the TF/Keras logic from the "user" logic.
Some "pseudo-code" below:

# Abstract Class
class _WrappingOptimizer():
    # Keep any inner workings here
    def _pre_grad_aggregation(self, *a, **kw):
         # do smthg
         self.transform_loss(a, kw)
         # do smthg
    
     def _aggregate_grad(self, *a, **kw):
          pass

     ...

class WrappingOptimizer(_WrappingOptimizer):
    def transform_loss(self, *a, **kw):
         pass

    def aggregate_grads(self, *a, **kw):
         pass

Not sure if it is clear ;)
We could separate how Keras will use and call the different methods. And what we expect the user to overwrite and modify.

That way we can keep a stable public API and change as often as required how we use the exposed API.

Going that way would probably be a little more readable. Passing around functions makes it quite hard to read the code ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the methods to public for now and added it as a discussion item for whether they should be private or public.

Re methods vs init arguments, my opinion is this: init arguments are great for the simple use cases we expect most users to have (e.g. gradient clipping, custom weight decay, aggregating by mean rather than by sum, etc). In these examples, each init argument is self-contained.

However, for more advanced use cases like LossScaling and differential privacy optimizers, the transformations needed at each step of the process (loss transform, unaggregated gradient transform, gradient aggregation, aggregated gradient transform) are tightly coupled. In these cases, having a subclass that contains all of this tightly coupled logic seems to make the most sense to me.

I think the current design, where the two most common transforms (aggregation and post-aggregations transformation) can be passed as __init__ arguments, and every discrete step of the minimize process has its own overridable method, achieves the best of the both worlds: simple use cases don't require subclassing, and advanced users have maximum flexibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reedwm re related comment

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
# Can be overridden in subclasses
return loss

def _get_gradients(self, loss, variables, tape):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstrings to this proposal? It's not obvious from reading through what the contract for each of these methods is. What can/must this return? What are each of these args? As a subclasser, those are important details to understand.

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Show resolved Hide resolved
@ematejska ematejska changed the title Easily Customizable Optimizer.minimize RFC RFC: Easily Customizable Optimizer.minimize Apr 23, 2020
@reedwm
Copy link
Member

reedwm commented Apr 24, 2020

/CC @DEKHTIARJonathan, @tgaddair. This API will be relevant to Horovod's distributed optimizer.

@DEKHTIARJonathan
Copy link

DEKHTIARJonathan commented Apr 24, 2020

Thanks @reedwm, much appreciated to be informed :)

I actually truely like the idea of this RFC. I have defended many points of the design made in this RFC in this PR: tensorflow/tensorflow#36398.

About:

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper? @omalleyt12

I would be in favor of this. Having OptimizerWrapper would greatly improved the maintability of AMP and Horovod and allow us to provide "a standard" way to plug into TensorFlow for anyone thinking of doing it. And on the core API side, it simplifies API support because you can state that the public API of the OptimizerWrapper class are the only ones supported on the long term and in the background interface it in whatever way is needed, totally abstract from Horovod or AMP Optimizers.

Overall, I'm really enthusiastic about this RFC. Sounds like a great improvement 👍

rfcs/2020-04-20-Optimizer-minimize.md Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
self.transform_gradients_fns = transform_gradients_fns

def _transform_loss(self, loss):
# Can be overridden in subclasses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved
def apply_gradients(self, grads_and_vars, aggregate=True):
if aggregate:
grads_and_vars = self._aggregate_gradients(grads_and_vars)
grads_and_vars = self._transform_gradients(grads_and_vars) # No-op by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again for this check: shouldn't this transformation happen regardless of aggregation?
Are we guarding it because there's the dynamic where aggregate_gradients must be called once to avoid double-aggregation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since transform_gradients is a method that applies only to aggregated gradients, in this case it can't be performed w/o aggregation

@gabrieldemarmiesse
Copy link
Member

@hyang0129 could you take a look at this RFC to see if it would make tensorflow/addons#969 easier to implement? Maybe add some comments if you have ideas.

@Squadrick @CyberZHG friendly ping to see if this rfc can make the implementation of Stochastic Weight Averaging , moving average wrapper and lookahead mechanism simpler in tensorflow addons.

@hyang0129
Copy link

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

@omalleyt12
Copy link
Contributor Author

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

@hyang0129 Agreed, the loss scaling optimizer example doesn't actually scale the gradients. What it does is that it temporarily scales up the loss to compute gradients in a numerically stable way, and then unscales them before applying Variable updates

For other, user-written gradient transformations such as gradient clipping, ensuring that the math is doing what they want is up to them

@Squadrick
Copy link
Member

@gabrieldemarmiesse Sorry about the delay. Moving average_wrapper would be fairly easy. We need to override transform_unaggregated_gradientsonly.

Once average_wrapper is migrated, migrating moving_average and stochastic_weight_average optimizers would be very simple.

@@ -276,6 +284,9 @@ Or, if backwards compatibility is not needed, simply:
optimizer = tf.keras.optimizers.Adam(1e-3, aggregate_gradients=horovod.aggregate)
```

## `OptimizerWrapper`

With this proposal, we should also release an `OptimizerWrapper` class. This class will make it easier for developers to create subclasses that wrap an `Optimizer` while providing additional functionality, such as mixed-precision, Horovod, or differential privacy use cases.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the wrapper go through separate review? What APIs will it provide?

if aggregate_gradients is None:
aggregate_gradients = all_reduce_sum
self.aggregate_gradients_fn = aggregate_gradients
self.transform_gradients_fns = transform_gradients
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unexpected that aggregate_gradients is a single fn, and transform_gradients is a list of fns. The naming with fn and fns is at least more clear, though one wonders if it's also worth making the two args consistent. Eg, allowing list-wise composition with aggregate gradients, or, better yet, requiring one fn for transform and expecting that users compose the fn themselves. That is--

# Current input:
transform_gradients_fns = [fn_1, fn_2]

# Composition on the user side
def transform_gradients_composed(grads_and_vars):
  return fn_2(fn_1(grads_and_vars))

Optimizer(transform_gradients=transform_gradients_composed)

Would that work, or is there some reason that's less preferable?

## Questions and Discussion Topics

(1) Should these new hooks be private or public methods?
(2) Should we create an initializer argument for each hook, or only for the ones we expect most users to need (`aggregate_gradients` and `transform_gradients`)?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.1 Or no constructors or no overwritables?
3. What are the contracts for the constructor arg functions? Single fns, lists, something more formal that we can better error/type-check?

@ematejska
Copy link
Contributor

@omalleyt12 Please post the notes from the design review here.

@ematejska
Copy link
Contributor

@omalleyt12 A gentle ping so we can finish this up.

@ematejska
Copy link
Contributor

@omalleyt12 told me that this design will likely need another round of review at some point as there were differing opinions.

@ematejska ematejska removed the RFC: Proposed RFC Design Document label Jul 7, 2020
@adriendoerig
Copy link

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

@byronyi
Copy link
Contributor

byronyi commented Aug 19, 2020

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

tensorflow/tensorflow@5357606

@DEKHTIARJonathan
Copy link

DEKHTIARJonathan commented Aug 19, 2020

@adriendoerig if you use Horovod that would be already quite easy to perform :)


This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).

Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two line references here should probably not reference master branch: they no longer point to the code you want to (as far as I can tell)

@DepenM
Copy link

DepenM commented Dec 28, 2021

Just wondering, why clip by global norm also not easily supported for distribution strategy? It would be really helpful as it keeps the gradient direction unchanged.

@ematejska
Copy link
Contributor

@omalleyt12, @fchollet Are you still pursuing this or should this be closed out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.