This repository was archived by the owner on Jul 10, 2025. It is now read-only.

RFC: Keras Mixed Precision API #293

Merged

ematejska merged 7 commits into tensorflow:master from reedwm:mixed-precision

Oct 20, 2020

Member

reedwm commented Sep 30, 2020 •

edited

Loading

This RFC will be open for comment until October 14, 2020.

Status	Proposed
RFC #	293
Author(s)	Reed Wanderman-Milne (reedwm@google.com
Sponsor	Francois Chollet (fchollet@google.com)
Updated	2020-09-29

Objective

Make mixed precision easy to use in Keras.


          Create 20200929-keras-mixed-precision.md

bf7d3b4

googlebot added the cla: yes label

reedwm added 2 commits

September 29, 2020 20:05


          Move RFC to correct directory

17010cd


          Update RFC number

01945d0

reedwm requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners

September 30, 2020 03:06

reedwm mentioned this pull request

Make API for custom optimizer wrappers more consistent tensorflow/addons#2187

Closed

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request


          Delegate hyperparameter accesses in LossScaleOptimizer.

9cd28f8

This change is described in the RFC: tensorflow/community#293. If an alternative approach ends up being used in the RFC, I will revert this change and implement the other approach.

PiperOrigin-RevId: 335074869
Change-Id: I07aca34b8a475500107944498b4769d57cbd1bac

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          Change default value of loss_scale in Policy.__init__.

37cf5b4

This has no functional effect except the string 'USE_DEFAULT' can longer be passed to the Policy constructor, but I don't think anyone does that. The default value is changed from 'USE_DEFAULT' to 'auto', but the default value as the same effect as before.

This change is described in the RFC: tensorflow/community#293. If an alternative approach ends up being used in the RFC, I will revert this change and implement the other approach.

PiperOrigin-RevId: 336012719
Change-Id: I339a29305276bf1e52af555df9e84090a96db6b8

yimeisun123 reviewed

View reviewed changes

rfcs/20200929-keras-mixed-precision.md Outdated Show resolved Hide resolved

yimeisun123 reviewed

View reviewed changes

rfcs/20200929-keras-mixed-precision.md Outdated Show resolved Hide resolved

reedwm added 2 commits

October 9, 2020 12:56


          Update doc based on comments

44fe506


          Expand ToC and minor formatting changes

635e22d

ematejska added the RFC: Proposed label

lgeiger reviewed

View reviewed changes

lgeiger left a comment

@reedwm Thank you very much for the in-depth writeup. I made a few minor comments below, I am currious about what you think.

rfcs/20200929-keras-mixed-precision.md

    
              * `"mixed_float16"`: The compute dtype is float16. The variable dtype is float32. The default loss scale is "dynamic".

              * `"mixed_bfloat16"`: The compute dtype is bfloat16. The variable dtype is float32. There is no default loss scale, as loss scaling is only useful when float16 is used.

              Unlike most TensorFlow functions with a `name` argument, the Policy `name` argument has a semantic impact on the TensorFlow program, and is not just used to uniquely identify an op or layer. The word "name" is chosen for lack of a better word to call the argument.

lgeiger Oct 13, 2020

How do you feel about calling it dtype instead of name to avoid confusion? "mixed_float16" and "mixed_bfloat16" could be interpreted as a string representation that are convertible to a custom DType.
policy_name would be an alternative option, although a bit too verbose for my taste.

Member Author

reedwm Oct 14, 2020

The issue is the argument is also exposed as as a property, so if it were called dtype instead of name, Policy would have a dtype field, a compute_dtype field, and a variable_dtype field. It would be nonobvious that the dtype field was mostly unused and only used to determine the compute_dtype and variable_dtype fields.

lgeiger Oct 16, 2020

Thanks for the explanation, that makes sense 👍

rfcs/20200929-keras-mixed-precision.md

Comment on lines +198 to +204

    
              ```python

              # Use mixed precision, for Volta+ GPUs

              tf.keras.mixed_precision.set_global_policy("mixed_float16")

              # Use mixed precision, for Cooper Lake CPUs, Ampere GPUs, or Google TPUs

              tf.keras.mixed_precision.set_global_policy("mixed_bfloat16")

              ```

lgeiger Oct 13, 2020

The deprecated mixed precision graph rewrite opted out of using mixed precision on non-optimal hardware.
I expect users will now need to handle environment detection manually or write custom code to detect the current hardware? I think this might lead to confusion when running in heterogeneous computing environments which requires users to be aware which policy is optimal on which hardware platform.

How do you feel about adding an auto policy (alternatively called optimal or mixed) that would autoselect between either float32, mixed_float16 or mixed_bfloat16 depending on the hardware support? This would mean that there are no code changes required when switching between CPU only, Volta GPUs or Ampere GPUs systems.

Member Author

reedwm Oct 14, 2020

I strongly considered this earlier. My main concern was that I believed users should be aware of whether they were using float16 or bfloat16. Users need to explicitly use a LossScaleOptimizer if using a custom training loop with float16 (when using Model.fit, a LossScaleOptimizer will automatically be used but otherwise the user must explicitly wrap their optimizer). Additionally, the SavedModel and checkpoint format will save the loss scale if mixed_float16 is used, and I don't think the SavedModel/checkpoint format should depend solely on the device that is used.

I added a paragraph summarizing why we don't do this.

Also, the batch size you should run at depends on the device as well, as different GPUs have different amounts of memory, so already users must be somewhat aware of their devices.

In practice, I expect most models supporting multiple devices to have a flag to choose between float32, mixed_float16, and mixed_bfloat16, and that flag is directly passed to set_global_policy. Within Google, we do this but also have a per-device per-model config file specifying what flags to pass, so running the GPU config file causes mixed_float16 to be passed to the flag and running the TPU config file causes mxied_bfloat16 to be passed.

lgeiger Oct 16, 2020

I see your point that users might need to customize other training parameters depending on the device anyway which is indeed very common.

I think the main reason why I was missing this autoselection is that I often end up having model configs which enable mixed precision by default. However, sometimes it can be useful to run a few steps locally on a CPU only machine for debugging which now would require users to remember to disable mixed precision as it would lead to greatly decreased performance on CPU (at least that was the case a few months ago when I last tested this API).
I don't think this is a huge issue since one can easily check for the existance of a GPU in user code and disable mixed precision if needed, but if this becomes a common source of confusion it might be interesting to rethink this behaviour.

rfcs/20200929-keras-mixed-precision.md

    
              1. Multiply the loss by the loss scale.

              2. Divide the gradients by the loss scale.

              3. For a DynamicLossScale, update the loss scale. This means increasing or decreasing the loss scale and updating `num_good_steps` in accordance with the dynamic loss scaling algorithm.

              4. For a DynamicLossScale, skip applying gradients if they are not finite. Gradients are not finite if they have an Inf, -Inf, or NaN value. If any gradient of any replica has a nonfinite value, all gradients across all replicas are skipped for that step. For a FixedLossScale, gradients are unconditionally applied, just like when loss scaling is not used.

lgeiger Oct 13, 2020

What is the benefit of skipping update steps instead of recomputing the gradients with a scaled down loss until the gradients become finite (or until fixed number of retries is reached)? Intuitively this would enable slightly higher performance as the forward pass could be reused, or am I missing something?

Member Author

reedwm Oct 14, 2020

This is a good idea, but the issue is that we cannot run the loop if the user calls apply_gradients instead of minimize because apply_gradients does not compute the gradients. For example:

vars = ...
opt = tf.keras.mixed_precision.LossScaleOptimizer(...)
with tf.GradientTape() as tape:
  loss = get_model_loss()
  grads = tape.gradient(loss, vars)
  opt.apply_gradients(zip(grads, vars))

Since the gradients are computed outside the LossScaleOptimizer, LossScaleOptimizer has no way of repeatedly recomputing the gradients.

However, we can do this in minimize. I added a paragraph suggesting we should add an option to do this in the future (starting with "Instead of skipping steps when there are NaNs"). The reason this won't be done initially is that minimize is rarely used, but I anticipate it will be used more frequently due to #234 allowing users to pass a tensor to minimize once TF 2.4 is released.

I think the performance gain will be negligible, due to steps only being skipped 1/2000 of the time on average, but the behavior will be more intuitive. The skipping behavior is confusing because for the first few steps, the variables are not actually updated since the loss scale starts out so high.

lgeiger Oct 16, 2020

Yes, sorry I should've been clearer. I was mainly suggesting this for the use in keras.Model.fit() which indeed calls .minimize(). But since for custom training loops the skipping behaviour still needs to be supported anyway, I agree it makes sense to keep the skipping behaviour for now.


          Address comments

nluehr reviewed

View reviewed changes

rfcs/20200929-keras-mixed-precision.md

    
              # Alternatives Considered

              ## Op-based autocasting API

nluehr Oct 15, 2020

Op-based autocasting with down-graph type inheritance is important for providing robust accuracy in mixed precision training. This is particularly true in experimental and research applications. For example, early implementations of batch- and layer-norm used a sequence of low-level operations such as the following.

bn(x) = y * (x-mean(x)) / sqrt(var(x)+epsilon) + beta

Without computing the mean and variance ops in fp32, training will either diverge or achieve reduced accuracy for many important networks such as InceptionV3, ResNet50, and Xception.

Of course, once the utility of an experimental layer is proven, it may be replaced with a “fused” custom kernel that provides optimized performance and handles intermediate precisions internally. However, relatively few layers will merit this effort. Novel model architectures are an important area of innovation, and mixed precision needs to be robust so that researches and early adopters can benefit from the full performance of modern architectures.

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          Deprecate LossScale and modify Keras APIs to not use it.

51fbc48

LossScale and its subclasses are deprecated and will be removed from the TF 2 namespace in TensorFlow 2.5. It will still be accessible under the tf.compat.v1 namespace, and this change makes it non-experimental under the tf.compat.v1 namespace, exporting it as `tf.compat.v1.mixed_precision.LossScale`. LossScale cannot be removed from the tf.compat.v1 namespace since its used by the V1-only class tf.compat.v1.train.experimental.MixedPrecisionLossScaleOptimizer.

LossScaleOptimizer previously used a LossScale, but now it directly performs loss scaling within the class itself. Additionally a new non-experimental `tf.keras.mixed_precision.LossScaleOptimizer` has been introduced. Unlike the experimental LossScaleOptimizer, the non-experimental LossScaleOptimizer does not accept a LossScale but instead has different constructor arguments to specify the type of loss scaling to be done. The old experimental LossScaleOptimizer will be removed in TensorFlow 2.5, at which point a LossScale cannot be used with any Keras LossScaleOptimizer.

Internally, LossScaleOptimizer uses a fork of DynamicLossScale called _DynamicLossScaleState, but this is not exposed to the user. In the future, _DynamicLossScaleState will be merged into LossScaleOptimizer.

LossScaleOptimizer now exposes some attributes that DynamicLossScale previously did. "increment_period" is renamed to "dynamic_growth_steps" for consistency with `ExponentialDecay.decay_steps`. `num_good_steps` is replaced by `dynamic_counter`.

LossScaleOptimizer.loss_scale is now a tensor, not a LossScale. This means the previous way of getting the loss scale as a tensor (calling `optimizer.loss_scale()`) will raise an error instead. I don't know of any users who do this, so I do not anticipate any breakages.

Policy previously had an instance of a LossScale, and optionally took a LossScale in the constructor. By default, the "mixed_float16" policy had a DynamicLossScale, while all other policies had no loss scale. Now, Policy no longer has a loss scale or takes an instance of a loss scale. To temporarily preserve backwards compatibility with the old API, the symbol `tf.keras.mixed_precision.experimental.Policy` still takes and holds a LossScale, as it did before. A new non-experimental symbol, `tf.keras.mixed_precision.Policy`, removes the use of the LossScale. The old experimental symbol will be removed in the future.

When deserializing a layer or model with an old experimental policy, it will be restored as the new policy and the loss scale will be silently dropped. This is to preserve SavedModel compatibility with models saved in TensorFlow 2.3 and restored in future versions of TensorFlow once the old experimental Policy is removed. Luckily, dropping the loss scale is unlikely to break anyone, as a bug in the mixed precision API causes models to not save their dtype policies at all when being serialized. Similarly, when deserializing a model with the old experimental LossScaleOptimizer, it will be restored as the new LossScaleOptimizer but unlike the policy case, nothing is silently dropped.

This change is different than what is described in the mixed precision RFC (tensorflow/community#293) but I think this API is a lot clearer and simpler than the API in the RFC. The RFC forked the LossScale classes into Keras, but I now think its better to simply not use them and make LossScale exposed under tf.compat.v1 only. This new API was designed based on feedback from @fchollet and @omalleyt12. I will retroactively update the RFC to reflect this API.

PiperOrigin-RevId: 337938270
Change-Id: Id7bb3bb89eb2143e5fadabeb2f57d1f8267379b3

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          Remove Policy.should_cast_variables property.

d119ff9

The RFC does have this property (tensorflow/community#293) but I don't think this property is very useful, and there are no uses of it within Google outside Keras, so it should be removed.

PiperOrigin-RevId: 337950640
Change-Id: I64c27589e87e4bf8f3f9c7fe38150703d914e804

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request


          Make the rest of the mixed precision API non-experimental.

bdc71a7

Additionally, the following attributes are added to Layer: `dtype_policy`, `compute_dtype`, `variable_dtype`.

The `inner_optimizer` attribute is added to LossScaleOptimizer.

This change follows the mixed precision RFC: tensorflow/community#293

I'll move the mixed_precision folder out of the experimental folder in a subsequent change. That change will have no functional impact.

I also removed the "About the layer's `dtype` attribute" section from the base Layer docstring since it didn't properly describe mixed precision. I added some of the information to the Arguments section, which links to the Policy docstring for a complete description of layer dtypes. In a future change, I'll add a paragraph which better describes how layers use dtypes.

PiperOrigin-RevId: 337968442
Change-Id: I2738862faaabec14fe6675ea9f34075a5e56426a

Contributor

fchollet commented Oct 20, 2020

We did the internal design review for this API today. Since there were no outstanding issues, it was very quick.

Notes:

Reed: Let's present the RFC and go over the changes that happened since we sent it out.
Before the changes: there were 3 dynamic loss scale classes. They were glorified named tuples
Now: everything related to loss scaling moved to LossScaleOptimizer
- There's a dynamic boolean argument to use dynamic scaling or not
  - if true, you can use initial_scale and dynamic_growth_steps
  - It's a simple API
Q from Tomer: do you always have to use loss scale optimizer when using mixed precision?
- Reed: it's automated with compile/fit, but with a custom training loop you need to use it
- This is well documented

Status:

Design approved after the minor changes described during the meeting.


          Update 20200929-keras-mixed-precision.md

c1158dc

ematejska added RFC: Accepted and removed RFC: Proposed labels

ematejska approved these changes

View reviewed changes

ematejska merged commit 9e4cc50 into tensorflow:master

lgeiger mentioned this pull request

TF 2.3 training slowed down by 15% compared to 2.2 tensorflow/tensorflow#41827

Closed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Reviewers

nluehr nluehr left review comments

lgeiger lgeiger left review comments

yimeisun123 yimeisun123 left review comments

ematejska ematejska approved these changes

ewilderj Awaiting requested review from ewilderj

martinwicke Awaiting requested review from martinwicke

theadactyl Awaiting requested review from theadactyl theadactyl is a code owner

Labels

cla: yes RFC: Accepted