-
Notifications
You must be signed in to change notification settings - Fork 577
Conversation
This change is described in the RFC: tensorflow/community#293. If an alternative approach ends up being used in the RFC, I will revert this change and implement the other approach. PiperOrigin-RevId: 335074869 Change-Id: I07aca34b8a475500107944498b4769d57cbd1bac
This has no functional effect except the string 'USE_DEFAULT' can longer be passed to the Policy constructor, but I don't think anyone does that. The default value is changed from 'USE_DEFAULT' to 'auto', but the default value as the same effect as before. This change is described in the RFC: tensorflow/community#293. If an alternative approach ends up being used in the RFC, I will revert this change and implement the other approach. PiperOrigin-RevId: 336012719 Change-Id: I339a29305276bf1e52af555df9e84090a96db6b8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reedwm Thank you very much for the in-depth writeup. I made a few minor comments below, I am currious about what you think.
* `"mixed_float16"`: The compute dtype is float16. The variable dtype is float32. The default loss scale is "dynamic". | ||
* `"mixed_bfloat16"`: The compute dtype is bfloat16. The variable dtype is float32. There is no default loss scale, as loss scaling is only useful when float16 is used. | ||
|
||
Unlike most TensorFlow functions with a `name` argument, the Policy `name` argument has a semantic impact on the TensorFlow program, and is not just used to uniquely identify an op or layer. The word "name" is chosen for lack of a better word to call the argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about calling it dtype
instead of name
to avoid confusion? "mixed_float16"
and "mixed_bfloat16"
could be interpreted as a string representation that are convertible to a custom DType.
policy_name
would be an alternative option, although a bit too verbose for my taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is the argument is also exposed as as a property, so if it were called dtype
instead of name
, Policy would have a dtype
field, a compute_dtype
field, and a variable_dtype
field. It would be nonobvious that the dtype
field was mostly unused and only used to determine the compute_dtype
and variable_dtype
fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, that makes sense 👍
```python | ||
# Use mixed precision, for Volta+ GPUs | ||
tf.keras.mixed_precision.set_global_policy("mixed_float16") | ||
|
||
# Use mixed precision, for Cooper Lake CPUs, Ampere GPUs, or Google TPUs | ||
tf.keras.mixed_precision.set_global_policy("mixed_bfloat16") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecated mixed precision graph rewrite opted out of using mixed precision on non-optimal hardware.
I expect users will now need to handle environment detection manually or write custom code to detect the current hardware? I think this might lead to confusion when running in heterogeneous computing environments which requires users to be aware which policy is optimal on which hardware platform.
How do you feel about adding an auto
policy (alternatively called optimal
or mixed
) that would autoselect between either float32
, mixed_float16
or mixed_bfloat16
depending on the hardware support? This would mean that there are no code changes required when switching between CPU only, Volta GPUs or Ampere GPUs systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly considered this earlier. My main concern was that I believed users should be aware of whether they were using float16 or bfloat16. Users need to explicitly use a LossScaleOptimizer if using a custom training loop with float16 (when using Model.fit
, a LossScaleOptimizer will automatically be used but otherwise the user must explicitly wrap their optimizer). Additionally, the SavedModel and checkpoint format will save the loss scale if mixed_float16 is used, and I don't think the SavedModel/checkpoint format should depend solely on the device that is used.
I added a paragraph summarizing why we don't do this.
Also, the batch size you should run at depends on the device as well, as different GPUs have different amounts of memory, so already users must be somewhat aware of their devices.
In practice, I expect most models supporting multiple devices to have a flag to choose between float32, mixed_float16, and mixed_bfloat16, and that flag is directly passed to set_global_policy
. Within Google, we do this but also have a per-device per-model config file specifying what flags to pass, so running the GPU config file causes mixed_float16 to be passed to the flag and running the TPU config file causes mxied_bfloat16 to be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point that users might need to customize other training parameters depending on the device anyway which is indeed very common.
I think the main reason why I was missing this autoselection is that I often end up having model configs which enable mixed precision by default. However, sometimes it can be useful to run a few steps locally on a CPU only machine for debugging which now would require users to remember to disable mixed precision as it would lead to greatly decreased performance on CPU (at least that was the case a few months ago when I last tested this API).
I don't think this is a huge issue since one can easily check for the existance of a GPU in user code and disable mixed precision if needed, but if this becomes a common source of confusion it might be interesting to rethink this behaviour.
1. Multiply the loss by the loss scale. | ||
2. Divide the gradients by the loss scale. | ||
3. For a DynamicLossScale, update the loss scale. This means increasing or decreasing the loss scale and updating `num_good_steps` in accordance with the dynamic loss scaling algorithm. | ||
4. For a DynamicLossScale, skip applying gradients if they are not finite. Gradients are not finite if they have an Inf, -Inf, or NaN value. If any gradient of any replica has a nonfinite value, all gradients across all replicas are skipped for that step. For a FixedLossScale, gradients are unconditionally applied, just like when loss scaling is not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of skipping update steps instead of recomputing the gradients with a scaled down loss until the gradients become finite (or until fixed number of retries is reached)? Intuitively this would enable slightly higher performance as the forward pass could be reused, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, but the issue is that we cannot run the loop if the user calls apply_gradients
instead of minimize
because apply_gradients
does not compute the gradients. For example:
vars = ...
opt = tf.keras.mixed_precision.LossScaleOptimizer(...)
with tf.GradientTape() as tape:
loss = get_model_loss()
grads = tape.gradient(loss, vars)
opt.apply_gradients(zip(grads, vars))
Since the gradients are computed outside the LossScaleOptimizer, LossScaleOptimizer has no way of repeatedly recomputing the gradients.
However, we can do this in minimize
. I added a paragraph suggesting we should add an option to do this in the future (starting with "Instead of skipping steps when there are NaNs"). The reason this won't be done initially is that minimize
is rarely used, but I anticipate it will be used more frequently due to #234 allowing users to pass a tensor to minimize
once TF 2.4 is released.
I think the performance gain will be negligible, due to steps only being skipped 1/2000 of the time on average, but the behavior will be more intuitive. The skipping behavior is confusing because for the first few steps, the variables are not actually updated since the loss scale starts out so high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I should've been clearer. I was mainly suggesting this for the use in keras.Model.fit()
which indeed calls .minimize()
. But since for custom training loops the skipping behaviour still needs to be supported anyway, I agree it makes sense to keep the skipping behaviour for now.
|
||
# Alternatives Considered | ||
|
||
## Op-based autocasting API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Op-based autocasting with down-graph type inheritance is important for providing robust accuracy in mixed precision training. This is particularly true in experimental and research applications. For example, early implementations of batch- and layer-norm used a sequence of low-level operations such as the following.
bn(x) = y * (x-mean(x)) / sqrt(var(x)+epsilon) + beta
Without computing the mean and variance ops in fp32, training will either diverge or achieve reduced accuracy for many important networks such as InceptionV3, ResNet50, and Xception.
Of course, once the utility of an experimental layer is proven, it may be replaced with a “fused” custom kernel that provides optimized performance and handles intermediate precisions internally. However, relatively few layers will merit this effort. Novel model architectures are an important area of innovation, and mixed precision needs to be robust so that researches and early adopters can benefit from the full performance of modern architectures.
LossScale and its subclasses are deprecated and will be removed from the TF 2 namespace in TensorFlow 2.5. It will still be accessible under the tf.compat.v1 namespace, and this change makes it non-experimental under the tf.compat.v1 namespace, exporting it as `tf.compat.v1.mixed_precision.LossScale`. LossScale cannot be removed from the tf.compat.v1 namespace since its used by the V1-only class tf.compat.v1.train.experimental.MixedPrecisionLossScaleOptimizer. LossScaleOptimizer previously used a LossScale, but now it directly performs loss scaling within the class itself. Additionally a new non-experimental `tf.keras.mixed_precision.LossScaleOptimizer` has been introduced. Unlike the experimental LossScaleOptimizer, the non-experimental LossScaleOptimizer does not accept a LossScale but instead has different constructor arguments to specify the type of loss scaling to be done. The old experimental LossScaleOptimizer will be removed in TensorFlow 2.5, at which point a LossScale cannot be used with any Keras LossScaleOptimizer. Internally, LossScaleOptimizer uses a fork of DynamicLossScale called _DynamicLossScaleState, but this is not exposed to the user. In the future, _DynamicLossScaleState will be merged into LossScaleOptimizer. LossScaleOptimizer now exposes some attributes that DynamicLossScale previously did. "increment_period" is renamed to "dynamic_growth_steps" for consistency with `ExponentialDecay.decay_steps`. `num_good_steps` is replaced by `dynamic_counter`. LossScaleOptimizer.loss_scale is now a tensor, not a LossScale. This means the previous way of getting the loss scale as a tensor (calling `optimizer.loss_scale()`) will raise an error instead. I don't know of any users who do this, so I do not anticipate any breakages. Policy previously had an instance of a LossScale, and optionally took a LossScale in the constructor. By default, the "mixed_float16" policy had a DynamicLossScale, while all other policies had no loss scale. Now, Policy no longer has a loss scale or takes an instance of a loss scale. To temporarily preserve backwards compatibility with the old API, the symbol `tf.keras.mixed_precision.experimental.Policy` still takes and holds a LossScale, as it did before. A new non-experimental symbol, `tf.keras.mixed_precision.Policy`, removes the use of the LossScale. The old experimental symbol will be removed in the future. When deserializing a layer or model with an old experimental policy, it will be restored as the new policy and the loss scale will be silently dropped. This is to preserve SavedModel compatibility with models saved in TensorFlow 2.3 and restored in future versions of TensorFlow once the old experimental Policy is removed. Luckily, dropping the loss scale is unlikely to break anyone, as a bug in the mixed precision API causes models to not save their dtype policies at all when being serialized. Similarly, when deserializing a model with the old experimental LossScaleOptimizer, it will be restored as the new LossScaleOptimizer but unlike the policy case, nothing is silently dropped. This change is different than what is described in the mixed precision RFC (tensorflow/community#293) but I think this API is a lot clearer and simpler than the API in the RFC. The RFC forked the LossScale classes into Keras, but I now think its better to simply not use them and make LossScale exposed under tf.compat.v1 only. This new API was designed based on feedback from @fchollet and @omalleyt12. I will retroactively update the RFC to reflect this API. PiperOrigin-RevId: 337938270 Change-Id: Id7bb3bb89eb2143e5fadabeb2f57d1f8267379b3
The RFC does have this property (tensorflow/community#293) but I don't think this property is very useful, and there are no uses of it within Google outside Keras, so it should be removed. PiperOrigin-RevId: 337950640 Change-Id: I64c27589e87e4bf8f3f9c7fe38150703d914e804
Additionally, the following attributes are added to Layer: `dtype_policy`, `compute_dtype`, `variable_dtype`. The `inner_optimizer` attribute is added to LossScaleOptimizer. This change follows the mixed precision RFC: tensorflow/community#293 I'll move the mixed_precision folder out of the experimental folder in a subsequent change. That change will have no functional impact. I also removed the "About the layer's `dtype` attribute" section from the base Layer docstring since it didn't properly describe mixed precision. I added some of the information to the Arguments section, which links to the Policy docstring for a complete description of layer dtypes. In a future change, I'll add a paragraph which better describes how layers use dtypes. PiperOrigin-RevId: 337968442 Change-Id: I2738862faaabec14fe6675ea9f34075a5e56426a
We did the internal design review for this API today. Since there were no outstanding issues, it was very quick. Notes:
Status:Design approved after the minor changes described during the meeting. |
This RFC will be open for comment until October 14, 2020.
Objective
Make mixed precision easy to use in Keras.