Support pure fp16 training for AMP API. #29544

wzzju · 2020-12-10T07:38:38Z

PR types

New features

PR changes

Others

Describe

1. Background

In the previous AMP implementation, it uses the blac&white list to control the float16 computation. However, this strategy has two shortcomings:

Inserting many cast ops may lead to some overheads, which can be 5% ~ 10%.
It is too conservative to speedup in float16 computation, which can use more float16 kernels in some models.

So, we develop the pure fp16 training strategy, which uses float16 kernels as much as possible.

2. API Function

# 1) The entry of Paddle AMP.
def decorate(optimizer,
             amp_lists=None,
             init_loss_scaling=2**15,
             incr_every_n_steps=1000,
             decr_every_n_nan_or_inf=2,
             incr_ratio=2.0,
             decr_ratio=0.8,
             use_dynamic_loss_scaling=True,
             use_pure_fp16=False, # new parameter
             use_fp16_guard=None) # new parameter

# 2) The context manager of pure fp16 training.
def fp16_guard()

As shown above, in order to integrate pure fp16 training into decorate API, we add two new parameters.

2.1 Description of `use_pure_fp16` parameter

When the parameter use_pure_fp16 is set to True, it will use float16 kernels as many as possible. Otherwise, it will adopt the black&white list based strategy.

2.2 Description of `use_fp16_guard` parameter and `fp16_guard` API

The second new parameter use_fp16_guard can control the part of float16 computation. When use_fp16_guard is set to False, all of the operators used in the user-defined model will be transformed as float16 type except for those in unsupported_fp16_list. When use_fp16_guard is set to True, only those ops created in the context manager fp16_guard will be transformed as float16 type. By default, the use_fp16_guard is set to None, which means that its value is equal to use_pure_fp16.

2.3 Details about `custom_black_list`

What's more, if users don't want to transform some op types as float16, they can define them in custom_black_list. If users set the custom_black_list, these ops in custom_black_list will keep in the float32 computation type whether they use use_fp16_guard or not.

2.4 Description of `amp_init` API

When users choose pure fp16 training, they should use amp_init API to initialize float16 parameters, as shown below.

# 3) `amp_init` is required by the pure fp16 training for initializing float16 parameters.
 def amp_init(self,
              place,
              scope=None,
              test_program=None,
              use_fp16_test=False)

Parameters defined in API 3) are described below:

The place is used to initialize fp16 parameters with fp32 values.
The scope is used to find fp32 parameters.
The program is used for testing.
The use_fp16_test indicates whether to use fp16 testing.

Previously, the black&white list based strategy just transform the training program, not including the testing program. But for now, pure fp16 training also needs to transform the testing program because there is no float32 parameter in the training and testing process. If users have used pure fp16 training, the testing program should be passed into amp_init if users want to perform the testing process.

The use_fp16_test is mainly used to control whether to transform the testing program as float16 type in black & white list based AMP strategy, and it makes no effect on pure fp16 training. In other word, if users choose the pure fp16 training and pass the test_program into amp_init API, the test_program will be transformed as float16 type ignore the use_fp16_test value.

2.5 Low-level APIs

# 4) cast the model to fp16
def cast_model_to_fp16(program, amp_lists=None, use_fp16_guard=True)

# 5) cast model parameters to fp16
def cast_parameters_to_fp16(place, program, scope=None, to_fp16_var_names=None)

cast_model_to_fp16 and cast_parameters_to_fp16 are two low-level APIs. In most cases, users don't need to use them, and just use decorate API.

cast_model_to_fp16
The parameter program is the program to be cast into fp16. The meaning of amp_lists and use_fp16_guard is the same as the definition in decorate. In the special case described as following, the user may need to use this API. If users have used the decorate API to complete pure fp16 training. And they don't use save_inference_model and load_inference_model to do the inference. On the contrary, they define a new inference program and load the pre-trained weights. In this case, they should cast the defined inference program to fp16 by cast_model_to_fp16 API. Meanwhile, they need to ensure that the values of amp_lists and use_fp16_guard are the same as in the previous pure fp16 training. If the user sets use_fp16_guard to True, they should use fp16_guard in the same place as in the previous pure fp16 training when building the inference program.
cast_parameters_to_fp16
The parameter program is the model to be processed. The place is used to restore the fp16 weight tensors and the scope is used to get the fp32 weight tensor values. Only the data types of vars in to_fp16_var_names will be set to FP16. Usually, to_fp16_var_names is the returned value of the cast_model_to_fp16 API.
By now, cast_parameters_to_fp16 has no use case, we just set it aside for future special use.

3. Use case

import paddle
import numpy
import paddle.nn.functional as F
import paddle.utils as utils
paddle.enable_static()

def build_model(main_prog, startup_prog, is_train=True):
    with utils.unique_name.guard():
        with paddle.static.program_guard(main_prog, startup_prog):
            data = paddle.static.data(name='image', shape=[None, 1, 28, 28], dtype='float32')
            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
            conv2d = paddle.static.nn.conv2d(input=data, num_filters=6, filter_size=3)
            # 1) Use fp16_guard to control the range of fp16 kernels used.
            with paddle.static.amp.fp16_guard():
                bn = paddle.static.nn.batch_norm(input=conv2d, act="relu")
                pool = F.max_pool2d(bn, kernel_size=2, stride=2)
                hidden = paddle.static.nn.fc(pool, size=64)
                predict = F.softmax(hidden)
                if is_train:
                    loss = F.cross_entropy(input=predict, label=label, reduction='mean')
                else:
                    loss = predict
                return data, label, loss
                
train_program = paddle.static.Program()
test_program = paddle.static.Program()
startup_program = paddle.static.Program()
data, label, loss = build_model(train_program, startup_program, True)
build_model(test_program, startup_program, False)

# 2) Create the optimizer and set `multi_precision` to True.
# Setting `multi_precision` to True can avoid the poor accuracy
# or the slow convergence in a way. 
optimizer = paddle.optimizer.Adam(
    learning_rate=0.001, multi_precision=True)

# 3) These ops in `custom_black_list` will keep in the float32 computation type.
amp_list = paddle.static.amp.AutoMixedPrecisionLists(
    custom_black_list=['pool2d'])
# 4) The entry of Paddle AMP.
# Enable pure fp16 training by setting `use_pure_fp16` to True.
optimizer = paddle.static.amp.decorate(
    optimizer,
    amp_list,
    init_loss_scaling=128.0,
    use_dynamic_loss_scaling=True,
    use_pure_fp16=True)

# If you don't use the default_startup_program(), you sholud pass
# your defined `startup_program` into `minimize`. Because it is required
# by the master weight creation process.
optimizer.minimize(loss, startup_program)

place = paddle.CUDAPlace(0)
exe = paddle.static.Executor(place)
exe.run(startup_program)

# 5) Use `amp_init` after FP32 parameters initialization(such as `exe.run(startup_program)`).
# If you want to perform the testing process, you should pass `test_program` into `amp_init`.
optimizer.amp_init(place, test_program=test_program)

x = numpy.random.random(size=(4, 1, 28, 28)).astype('float32')
y = numpy.random.random(size=(4, 1)).astype('int64')
loss_data, = exe.run(train_program, feed={data.name: x, label.name: y }, fetch_list=[loss.name])
print(loss_data)

As shown below, the left is the original fp32 computation graph, and the right is the computation graph applied pure fp16 training.

4. Restriction

The pure fp16 training strategy requires the used optimizer to register the float16 kernel. Until now, Momentum, Adam and AdamW support the float16 computation. All three of them have the multi_precision parameter, which can avoid poor accuracy or slow convergence in a way. In the future, more optimizers will support float16 computation.

paddle-bot-old · 2020-12-10T07:38:44Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… add_o2_api

…parate way.

… add_o2_api

Xreki

LGTM. Great work~

zhiqiu

LGTM for unused_var_check.cc

swtkiwi

LGTM
辛苦后续继续补齐文档中的示例代码，以及增加中文文档～～

phlrain · 2021-01-08T03:37:02Z

paddle/fluid/operators/optimizers/adam_op.cc

    adam, ops::AdamOpKernel<paddle::platform::CPUDeviceContext, float>,
    ops::AdamOpKernel<paddle::platform::CPUDeviceContext, double>);
+
+REGISTER_OP_VERSION(adam)


adam属于训练的op，其实没有必要设置op version，加了也没有什么影响

好的，谢谢提醒。

* add cast ops before and after unsupported fp16 ops. * Keep partial net in FP32 pattern. * Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode. * Add fp16 support for adam op. * add multi precision attr for adam. * Fix the bug of test_multi_precision_fp16_train UT. * Code format for CI. * Fix the redefine error about MPTypeTrait on windows. * fix bugs of the _create_accumulators func in Momentum. * fix bug when inserting post cast op. * Add the update_loss_scaling op in allow_set of UnusedVarCheck. * Update for ci coverage. * Add some doc for OptimizerWithMixedPrecision. * Fix the code style. * Imporve the doc of `amp_init`. * Change for fp16 testing if users have the infer program defined in separate way.

* Support pure fp16 training for AMP API. (#29544) * add cast ops before and after unsupported fp16 ops. * Keep partial net in FP32 pattern. * Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode. * Add fp16 support for adam op. * add multi precision attr for adam. * Fix the bug of test_multi_precision_fp16_train UT. * Code format for CI. * Fix the redefine error about MPTypeTrait on windows. * fix bugs of the _create_accumulators func in Momentum. * fix bug when inserting post cast op. * Add the update_loss_scaling op in allow_set of UnusedVarCheck. * Update for ci coverage. * Add some doc for OptimizerWithMixedPrecision. * Fix the code style. * Imporve the doc of `amp_init`. * Change for fp16 testing if users have the infer program defined in separate way. * Remove tensor copy in the update_loss_scaling op. (#29426) * remove tensor copy in the update_loss_scaling op * not use thrust. * fix some cuda memory access error.

wzzju force-pushed the add_o2_api branch from b2621d9 to 693deba Compare December 11, 2020 08:07

wzzju force-pushed the add_o2_api branch from 693deba to 4b81eec Compare December 11, 2020 09:53

wzzju force-pushed the add_o2_api branch from 4b81eec to 0e0d076 Compare December 14, 2020 07:40

wzzju force-pushed the add_o2_api branch from 0e0d076 to d3f5acd Compare December 14, 2020 09:30

wzzju force-pushed the add_o2_api branch from d3f5acd to 1890bf6 Compare December 14, 2020 10:17

wzzju force-pushed the add_o2_api branch from 1890bf6 to 51d9033 Compare December 15, 2020 07:48

wzzju force-pushed the add_o2_api branch from 51d9033 to f8930cf Compare December 15, 2020 08:13

wzzju force-pushed the add_o2_api branch from f8930cf to fff5a95 Compare December 16, 2020 08:13

wzzju force-pushed the add_o2_api branch from fff5a95 to db95d52 Compare December 16, 2020 09:15

wzzju force-pushed the add_o2_api branch from 3abe044 to 0a00881 Compare December 21, 2020 03:46

wzzju force-pushed the add_o2_api branch from 0a00881 to 36cd2e7 Compare December 21, 2020 04:15

wzzju force-pushed the add_o2_api branch from 87dfecc to 1f8ba2e Compare December 22, 2020 04:39

wzzju force-pushed the add_o2_api branch from 1f8ba2e to d2b4083 Compare December 22, 2020 08:18

wzzju added 2 commits January 6, 2021 04:06

Fix the code style.

114f05d

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

21a09c5

… add_o2_api

wzzju force-pushed the add_o2_api branch from 883098e to 21a09c5 Compare January 6, 2021 04:07

wzzju requested review from phlrain and zhangting2020 January 6, 2021 11:54

Imporve the doc of amp_init.

fa5fb99

chalsliu previously approved these changes Jan 7, 2021

View reviewed changes

wzzju dismissed chalsliu’s stale review via 18108a9 January 7, 2021 04:18

wzzju force-pushed the add_o2_api branch from 18108a9 to f220d55 Compare January 7, 2021 04:45

Change for fp16 testing if users have the infer program defined in se…

8f3da71

…parate way.

wzzju force-pushed the add_o2_api branch from f220d55 to 8f3da71 Compare January 7, 2021 10:53

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

87bbbb9

… add_o2_api

Xreki approved these changes Jan 7, 2021

View reviewed changes

zhiqiu approved these changes Jan 8, 2021

View reviewed changes

lanxianghit approved these changes Jan 8, 2021

View reviewed changes

kolinwei approved these changes Jan 8, 2021

View reviewed changes

swtkiwi approved these changes Jan 8, 2021

View reviewed changes

Avin0323 mentioned this pull request Jan 8, 2021

fix the problem of Unity Build with incremental compilation #30232

Merged

phlrain approved these changes Jan 8, 2021

View reviewed changes

wzzju merged commit 7f7dfcc into PaddlePaddle:develop Jan 8, 2021

wzzju mentioned this pull request Jan 11, 2021

[Cherry-Pick] Support pure fp16 training for AMP API. (#29544) #30241

Merged

zhangbo9674 mentioned this pull request Sep 15, 2021

[AMP] Support pure fp16 training mode for dygraph #35521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support pure fp16 training for AMP API. #29544

Support pure fp16 training for AMP API. #29544

Uh oh!

wzzju commented Dec 10, 2020 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Dec 10, 2020

Uh oh!

Xreki left a comment

Uh oh!

zhiqiu left a comment

Uh oh!

swtkiwi left a comment

Uh oh!

phlrain Jan 8, 2021

Uh oh!

wzzju Jan 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Support pure fp16 training for AMP API. #29544

Support pure fp16 training for AMP API. #29544

Uh oh!

Conversation

wzzju commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

1. Background

2. API Function

2.1 Description of use_pure_fp16 parameter

2.2 Description of use_fp16_guard parameter and fp16_guard API

2.3 Details about custom_black_list

2.4 Description of amp_init API

2.5 Low-level APIs

3. Use case

4. Restriction

Uh oh!

paddle-bot-old bot commented Dec 10, 2020

Uh oh!

Xreki left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

swtkiwi left a comment

Choose a reason for hiding this comment

Uh oh!

phlrain Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

wzzju Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wzzju commented Dec 10, 2020 •

edited

Loading

2.1 Description of `use_pure_fp16` parameter

2.2 Description of `use_fp16_guard` parameter and `fp16_guard` API

2.3 Details about `custom_black_list`

2.4 Description of `amp_init` API