-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Easily Customizable Optimizer.minimize #234
base: master
Are you sure you want to change the base?
Changes from 3 commits
e42741e
04a1ab9
b3f6e45
4aac7f3
642626e
02a04d8
488d371
dbe788e
c0d7c08
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,293 @@ | ||
# Easily Customizable `Optimizer.minimize` | ||
|
||
|
||
| Status | Proposed | | ||
:-------------- |:---------------------------------------------------- | | ||
| **RFC #** | [234](https://github.com/tensorflow/community/pull/234) | | ||
| **Author(s)** | [omalleyt12@](https://github.com/omalleyt12) | | ||
| **Sponsor** | apassos@, fchollet@, karmel@ | | ||
| **Updated** | 2020-04-20 | | ||
|
||
## Objective | ||
|
||
Create an `Optimizer` API that gives `Optimizer` subclasses full control of gradient updates. The API should ensure `Optimizer`s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the `Optimizer` chooses to: | ||
|
||
* Scale losses and gradients | ||
|
||
* Aggregate gradients | ||
|
||
* Clip gradients | ||
|
||
* etc | ||
|
||
We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items. | ||
|
||
By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers. | ||
|
||
Specific use cases considered: | ||
|
||
* Gradient clipping | ||
|
||
* Mixed precision | ||
|
||
* `Horovod` | ||
|
||
## Background | ||
|
||
During backpropagation, there are 6 possible actions that can be taken when starting from a loss Tensor and ending with a Variable update: | ||
|
||
(1) Transform the loss | ||
|
||
(2) Calculate the gradients | ||
|
||
(3) Transform the unaggregated (per-device) gradients | ||
|
||
(4) Aggregate the gradients (across devices) | ||
|
||
(5) Transform the aggregated gradients | ||
|
||
(6) Apply a variable update based on the gradients | ||
|
||
We currently have three Optimizer endpoints that start at different points in this process: | ||
|
||
* `Optimizer.minimize` - handles 1-6 | ||
|
||
* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=True)` - handles 4-6 | ||
|
||
* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=False)` - handles 6 | ||
|
||
However, there is no easy way for Optimizer subclasses to support custom logic in these steps. This proposal suggests a refactoring of the Optimizer class to achieve these goals. | ||
|
||
|
||
## Motivation | ||
|
||
This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit). | ||
|
||
Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic. | ||
|
||
The only remaining pain point is the call to `_minimize` here: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1873). This logic is necessary because details of whether an `Optimizer` needs to transform the loss, clip the gradients, perform custom aggregation, etc have leaked into the main training loop code. | ||
|
||
Despite the complexity of `_minimize`, it covers only a small subset of possible optimization logic. Keras continues to receive valid requests to support more custom optimization logic (including adding hooks for different aggregation methods, different methods of loss reduction, etc). To continue expanding support for these items, Keras needs to rely on a unified API that keeps `Optimizer` implementation details from leaking into the main training loop code. | ||
|
||
The proposal below shows how this can be accomplished, and the examples section shows how this can be applied to 3 use cases: gradient clipping, mixed precision, and `Horovod`. | ||
|
||
### Custom training loops: | ||
|
||
The logic above also applies to custom training loops. The design should allow custom training loops to be written so that they work with any `Optimizer`. | ||
|
||
|
||
## User Benefit | ||
|
||
This design will allow users to write full-featured training loops that work for all `Optimizer`s. This design will allow users to easily perform custom gradient clipping and other transformations. | ||
|
||
## Design Proposal | ||
|
||
`Optimizer` class: | ||
|
||
```python | ||
class Optimizer(object): | ||
def __init__(self, | ||
transform_gradients=None, | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
aggregate_gradients=all_reduce_sum): | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.aggregate_gradients_fn = aggregate_gradients | ||
self.transform_gradients_fns = transform_gradients_fns | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def _transform_loss(self, loss): | ||
# Can be overridden in subclasses | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can subclassers rely on underscored method names? If these are backwards-compat, we should make it clear with no underscore. If they are not, we should not encourage subclassers to rely on them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why I prefer that these things are constructor arguments instead of methods to be overridden. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alextp how about we design separate the TF/Keras logic from the "user" logic. # Abstract Class
class _WrappingOptimizer():
# Keep any inner workings here
def _pre_grad_aggregation(self, *a, **kw):
# do smthg
self.transform_loss(a, kw)
# do smthg
def _aggregate_grad(self, *a, **kw):
pass
...
class WrappingOptimizer(_WrappingOptimizer):
def transform_loss(self, *a, **kw):
pass
def aggregate_grads(self, *a, **kw):
pass Not sure if it is clear ;) That way we can keep a stable public API and change as often as required how we use the exposed API. Going that way would probably be a little more readable. Passing around functions makes it quite hard to read the code ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the methods to public for now and added it as a discussion item for whether they should be private or public. Re methods vs init arguments, my opinion is this: init arguments are great for the simple use cases we expect most users to have (e.g. gradient clipping, custom weight decay, aggregating by mean rather than by sum, etc). In these examples, each init argument is self-contained. However, for more advanced use cases like LossScaling and differential privacy optimizers, the transformations needed at each step of the process (loss transform, unaggregated gradient transform, gradient aggregation, aggregated gradient transform) are tightly coupled. In these cases, having a subclass that contains all of this tightly coupled logic seems to make the most sense to me. I think the current design, where the two most common transforms (aggregation and post-aggregations transformation) can be passed as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @reedwm re related comment |
||
return loss | ||
|
||
def _get_gradients(self, loss, variables, tape): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add docstrings to this proposal? It's not obvious from reading through what the contract for each of these methods is. What can/must this return? What are each of these args? As a subclasser, those are important details to understand. |
||
# Can be overridden to use jacobian, etc. | ||
return tape.gradient(loss, variables) | ||
|
||
def _transform_unaggregated_gradients(self, grads_and_vars): | ||
# Can be overridden in subclasses | ||
return grads_and_vars | ||
|
||
def _aggregate_gradients(self, grads_and_vars): | ||
# Can still be overridden in subclasses if needed | ||
if self.aggregate_gradients_fn: | ||
grads_and_vars = self.aggregate_gradients_fn( | ||
grads_and_vars) | ||
return grads_and_vars | ||
|
||
def _transform_gradients(self, grads_and_vars): | ||
# Can still be overridden in subclasses if needed | ||
if self.transform_gradients_fns: | ||
for fn in self.transform_gradients_fns: | ||
grads_and_vars = fn(grads_and_vars) | ||
return grads_and_vars | ||
|
||
def _apply_updates(self, distribution, grads_and_vars, ...): | ||
# Calls _resource_apply_{dense | sparse} | ||
# Variable updating math is still in _resource_apply_{dense | sparse} | ||
|
||
def minimize(self, loss, variables, tape=None): | ||
grads_and_vars = self.compute_gradients(loss, variables, tape) | ||
self.apply_gradients(grads_and_vars) | ||
|
||
def compute_gradients( | ||
self, | ||
loss, | ||
variables, | ||
tape=None, | ||
all_reduce_sum_gradients=False): | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if is_tensor(loss) and not tape: | ||
raise ValueError('Must provide tape with tensor loss.') | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
tape = tape or GradientTape() | ||
with tape: | ||
if callable(loss): | ||
loss = loss() | ||
loss = self._transform_loss(loss) # A no-op in our built-in optimizers | ||
gradients = self._get_gradients(loss, variables, tape) | ||
grads_and_vars = zip(gradients, variables) | ||
grads_and_vars = self._transform_unaggregated_gradients(grads_and_vars) | ||
if all_reduce_sum_gradients: | ||
grads_and_vars = self._aggregate_gradients(grads_and_vars) | ||
grads_and_vars = self._transform_gradients(grads_and_vars) | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return grads_and_vars | ||
|
||
def apply_gradients(self, grads_and_vars, aggregate=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The difference between |
||
if aggregate: | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
grads_and_vars = self._aggregate_gradients(grads_and_vars) | ||
grads_and_vars = self._transform_gradients(grads_and_vars) # No-op by default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again for this check: shouldn't this transformation happen regardless of aggregation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since |
||
# By passing all_reduce_sum_gradients, only the Variable updates are run. | ||
# This gives users complete control, in the case that they don't want to use | ||
# the hooks provided. | ||
self.apply_updates(grads_and_vars) | ||
``` | ||
|
||
|
||
Use of Optimizer.minimize in Model.train_step: | ||
|
||
```python | ||
class Model: | ||
|
||
def train_step(self, data): | ||
data = expand_1d(data) | ||
x, y, sample_weight = unpack_x_y_sample_weight(data) | ||
with tf.GradientTape() as tape: | ||
y_pred = self(x, training=True) | ||
loss = self.compiled_loss(y, y_pred, sample_weight, self.losses) | ||
self.optimizer.minimize(loss, self.trainable_variables, tape=tape) | ||
self.compiled_metrics.update_state(y, y_pred, sample_weight) | ||
return {m.name: m.result() for m in self.metrics} | ||
``` | ||
|
||
Details of proposal: | ||
|
||
* Adds the ability to accept a loss Tensor and a GradientTape to Optimizer.minimize. | ||
|
||
* Maintains full backwards compatibility. When a callable loss is passed, simply create a GradientTape and call the loss inside it like currently done. | ||
|
||
* Add public Optimizer methods that can be overridden to support custom functionality for the steps outlined in the Background section: | ||
|
||
|
||
(1) `Optimizer._transform_loss` | ||
|
||
(2) `Optimizer._get_gradients` | ||
|
||
(3) `Optimizer._transform_unaggregated_gradients` | ||
|
||
(4) `Optimizer._aggregate_gradients` | ||
|
||
(5) `Optimizer._transform_gradients` (aggregated gradients) | ||
|
||
(6) `Optimizer._apply_updates` (calls existing existing _resource_apply_{dense|sparse}) | ||
|
||
(a) Item (6) mirrors `Sonnet`’s apply method (i.e. is “just the math”) | ||
|
||
* Use Optimizer.minimize API in Model.fit | ||
|
||
* Optimizer.apply_gradients method is kept. For users who want to control all loss and gradient manipulation, and want the Optimizer to simply apply the Variable updates, they can call `Optimizer.apply_gradients(..., aggregate=False)` | ||
|
||
|
||
## Examples | ||
|
||
(1) Custom gradient clipping | ||
|
||
```python | ||
optimizer = tf.keras.optimizers.Adam(0.1, transform_gradients=my_gradient_clipping) | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
(2) Mixed precision (most complicated example): | ||
|
||
```python | ||
class LossScaleOptimizer(Optimizer) | ||
def __init__(self, optimizer): | ||
self.optimizer = optimizer | ||
|
||
def _get_hyper(self, name): | ||
# Optional. Allows access to the wrapped Optimizer's | ||
# hyperparameters (e.g. learning_rate) | ||
self.optimizer._get_hyper(name) | ||
|
||
def _transform_loss(self, loss): | ||
loss = self.optimizer._transform_loss(loss) | ||
# Mixed precision needs to scale loss before calculating gradients | ||
return self.scale_loss(loss) | ||
|
||
def _transform_unaggregated_gradients(self, grads_and_vars): | ||
# Note: For performance, we could add a check here to see if | ||
# self.optimizer._transform_unaggregated_gradients is not implemented, and if | ||
# so to skip these scaling / unscalings. Or Grappler could optimize it out. | ||
gradients, variables = unpack(grads_and_vars) | ||
gradients = self.unscale_gradients(gradients) | ||
gradients = self.optimizer._transform_unaggregated_gradients(gradients) | ||
# Mixed precision needs to all-reduce on scaled gradients. | ||
gradients = self.scale_gradients(gradients) | ||
return zip(gradients, variables) | ||
|
||
def _aggregate_gradients(self, grads_and_vars): | ||
return aggregate_in_fp16(grads_and_vars) | ||
|
||
def _transform_gradients(self, grads_and_vars): | ||
gradients, variables = unpack(grads_and_vars) | ||
gradients = unscale_gradients(gradients) | ||
gradients = self.optimizer._transform_fgradients(gradients) | ||
return zip(gradients, updates) | ||
|
||
def _apply_updates(self, grads_and_vars): | ||
return self.optimizer._apply_updates(grads_and_vars) | ||
``` | ||
|
||
(3) Horovod (only needs custom aggregation): | ||
|
||
To support backwards compatibility for Horovod: | ||
|
||
```python | ||
class HorovodOptimizer(Optimizer): | ||
def __init__(self, optimizer): | ||
self.optimizer = optimizer | ||
|
||
def _get_hyper(self, name): | ||
# See previous example | ||
self.optimizer._get_hyper(name) | ||
|
||
def _aggregate_gradients(self, grads_and_vars): | ||
return horovod_aggregate_gradients(grads_and_vars) | ||
|
||
# All other methods described in this proposal simply delegate to `self.optimizer` | ||
``` | ||
|
||
Or, if backwards compatibility is not needed, simply: | ||
|
||
```python | ||
optimizer = tf.keras.optimizers.Adam(1e-3, aggregate_gradients=horovod.aggregate) | ||
``` | ||
|
||
|
||
## Alternatives considered | ||
|
||
#### Handle this only in Model.fit, via custom hooks exposed on the Model class | ||
|
||
Why rejected: | ||
|
||
Shifts the responsibility for implementing and calling these hooks onto each user rather than the writer of the Optimizer subclass (Many users will write custom training logic, many fewer will write Optimizer subclasses). | ||
|
||
Solution is too Keras-specific, doesn’t solve the general problem. | ||
|
||
|
||
## Questions and Discussion Topics | ||
|
||
Should we create a utility class to help with wrapping an `Optimizer`? I.e. `OptimizerWrapper`? | ||
omalleyt12 marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two line references here should probably not reference master branch: they no longer point to the code you want to (as far as I can tell)