Skip to content

Conversation

@wzzju
Copy link
Contributor

@wzzju wzzju commented Dec 10, 2020

PR types

New features

PR changes

Others

Describe

1. Background

In the previous AMP implementation, it uses the blac&white list to control the float16 computation. However, this strategy has two shortcomings:

  • Inserting many cast ops may lead to some overheads, which can be 5% ~ 10%.
  • It is too conservative to speedup in float16 computation, which can use more float16 kernels in some models.

So, we develop the pure fp16 training strategy, which uses float16 kernels as much as possible.

2. API Function

# 1) The entry of Paddle AMP.
def decorate(optimizer,
             amp_lists=None,
             init_loss_scaling=2**15,
             incr_every_n_steps=1000,
             decr_every_n_nan_or_inf=2,
             incr_ratio=2.0,
             decr_ratio=0.8,
             use_dynamic_loss_scaling=True,
             use_pure_fp16=False, # new parameter
             use_fp16_guard=None) # new parameter

# 2) The context manager of pure fp16 training.
def fp16_guard()

As shown above, in order to integrate pure fp16 training into decorate API, we add two new parameters.

2.1 Description of use_pure_fp16 parameter

When the parameter use_pure_fp16 is set to True, it will use float16 kernels as many as possible. Otherwise, it will adopt the black&white list based strategy.

2.2 Description of use_fp16_guard parameter and fp16_guard API

The second new parameter use_fp16_guard can control the part of float16 computation. When use_fp16_guard is set to False, all of the operators used in the user-defined model will be transformed as float16 type except for those in unsupported_fp16_list. When use_fp16_guard is set to True, only those ops created in the context manager fp16_guard will be transformed as float16 type. By default, the use_fp16_guard is set to None, which means that its value is equal to use_pure_fp16.

2.3 Details about custom_black_list

What's more, if users don't want to transform some op types as float16, they can define them in custom_black_list. If users set the custom_black_list, these ops in custom_black_list will keep in the float32 computation type whether they use use_fp16_guard or not.

2.4 Description of amp_init API

When users choose pure fp16 training, they should use amp_init API to initialize float16 parameters, as shown below.

# 3) `amp_init` is required by the pure fp16 training for initializing float16 parameters.
 def amp_init(self,
              place,
              scope=None,
              test_program=None,
              use_fp16_test=False)

Parameters defined in API 3) are described below:

  • The place is used to initialize fp16 parameters with fp32 values.
  • The scope is used to find fp32 parameters.
  • The program is used for testing.
  • The use_fp16_test indicates whether to use fp16 testing.

Previously, the black&white list based strategy just transform the training program, not including the testing program. But for now, pure fp16 training also needs to transform the testing program because there is no float32 parameter in the training and testing process. If users have used pure fp16 training, the testing program should be passed into amp_init if users want to perform the testing process.

The use_fp16_test is mainly used to control whether to transform the testing program as float16 type in black & white list based AMP strategy, and it makes no effect on pure fp16 training. In other word, if users choose the pure fp16 training and pass the test_program into amp_init API, the test_program will be transformed as float16 type ignore the use_fp16_test value.

2.5 Low-level APIs
# 4) cast the model to fp16
def cast_model_to_fp16(program, amp_lists=None, use_fp16_guard=True)

# 5) cast model parameters to fp16
def cast_parameters_to_fp16(place, program, scope=None, to_fp16_var_names=None)

cast_model_to_fp16 and cast_parameters_to_fp16 are two low-level APIs. In most cases, users don't need to use them, and just use decorate API.

  • cast_model_to_fp16
    The parameter program is the program to be cast into fp16. The meaning of amp_lists and use_fp16_guard is the same as the definition in decorate. In the special case described as following, the user may need to use this API. If users have used the decorate API to complete pure fp16 training. And they don't use save_inference_model and load_inference_model to do the inference. On the contrary, they define a new inference program and load the pre-trained weights. In this case, they should cast the defined inference program to fp16 by cast_model_to_fp16 API. Meanwhile, they need to ensure that the values of amp_lists and use_fp16_guard are the same as in the previous pure fp16 training. If the user sets use_fp16_guard to True, they should use fp16_guard in the same place as in the previous pure fp16 training when building the inference program.

  • cast_parameters_to_fp16
    The parameter program is the model to be processed. The place is used to restore the fp16 weight tensors and the scope is used to get the fp32 weight tensor values. Only the data types of vars in to_fp16_var_names will be set to FP16. Usually, to_fp16_var_names is the returned value of the cast_model_to_fp16 API.
    By now, cast_parameters_to_fp16 has no use case, we just set it aside for future special use.

3. Use case

import paddle
import numpy
import paddle.nn.functional as F
import paddle.utils as utils
paddle.enable_static()

def build_model(main_prog, startup_prog, is_train=True):
    with utils.unique_name.guard():
        with paddle.static.program_guard(main_prog, startup_prog):
            data = paddle.static.data(name='image', shape=[None, 1, 28, 28], dtype='float32')
            label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
            conv2d = paddle.static.nn.conv2d(input=data, num_filters=6, filter_size=3)
            # 1) Use fp16_guard to control the range of fp16 kernels used.
            with paddle.static.amp.fp16_guard():
                bn = paddle.static.nn.batch_norm(input=conv2d, act="relu")
                pool = F.max_pool2d(bn, kernel_size=2, stride=2)
                hidden = paddle.static.nn.fc(pool, size=64)
                predict = F.softmax(hidden)
                if is_train:
                    loss = F.cross_entropy(input=predict, label=label, reduction='mean')
                else:
                    loss = predict
                return data, label, loss
                
train_program = paddle.static.Program()
test_program = paddle.static.Program()
startup_program = paddle.static.Program()
data, label, loss = build_model(train_program, startup_program, True)
build_model(test_program, startup_program, False)

# 2) Create the optimizer and set `multi_precision` to True.
# Setting `multi_precision` to True can avoid the poor accuracy
# or the slow convergence in a way. 
optimizer = paddle.optimizer.Adam(
    learning_rate=0.001, multi_precision=True)

# 3) These ops in `custom_black_list` will keep in the float32 computation type.
amp_list = paddle.static.amp.AutoMixedPrecisionLists(
    custom_black_list=['pool2d'])
# 4) The entry of Paddle AMP.
# Enable pure fp16 training by setting `use_pure_fp16` to True.
optimizer = paddle.static.amp.decorate(
    optimizer,
    amp_list,
    init_loss_scaling=128.0,
    use_dynamic_loss_scaling=True,
    use_pure_fp16=True)

# If you don't use the default_startup_program(), you sholud pass
# your defined `startup_program` into `minimize`. Because it is required
# by the master weight creation process.
optimizer.minimize(loss, startup_program)

place = paddle.CUDAPlace(0)
exe = paddle.static.Executor(place)
exe.run(startup_program)

# 5) Use `amp_init` after FP32 parameters initialization(such as `exe.run(startup_program)`).
# If you want to perform the testing process, you should pass `test_program` into `amp_init`.
optimizer.amp_init(place, test_program=test_program)

x = numpy.random.random(size=(4, 1, 28, 28)).astype('float32')
y = numpy.random.random(size=(4, 1)).astype('int64')
loss_data, = exe.run(train_program, feed={data.name: x, label.name: y }, fetch_list=[loss.name])
print(loss_data)

As shown below, the left is the original fp32 computation graph, and the right is the computation graph applied pure fp16 training.

4. Restriction

The pure fp16 training strategy requires the used optimizer to register the float16 kernel. Until now, Momentum, Adam and AdamW support the float16 computation. All three of them have the multi_precision parameter, which can avoid poor accuracy or slow convergence in a way. In the future, more optimizers will support float16 computation.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

chalsliu
chalsliu previously approved these changes Jan 7, 2021
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work~

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for unused_var_check.cc

Copy link
Contributor

@swtkiwi swtkiwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
辛苦后续继续补齐文档中的示例代码,以及增加中文文档~~

adam, ops::AdamOpKernel<paddle::platform::CPUDeviceContext, float>,
ops::AdamOpKernel<paddle::platform::CPUDeviceContext, double>);

REGISTER_OP_VERSION(adam)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adam属于训练的op,其实没有必要设置op version,加了也没有什么影响

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,谢谢提醒。

@wzzju wzzju merged commit 7f7dfcc into PaddlePaddle:develop Jan 8, 2021
wzzju added a commit to wzzju/Paddle that referenced this pull request Jan 8, 2021
* add cast ops before and after unsupported fp16 ops.

* Keep partial net in FP32 pattern.

* Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.

* Add fp16 support for adam op.

* add multi precision attr for adam.

* Fix the bug of test_multi_precision_fp16_train UT.

* Code format for CI.

* Fix the redefine error about MPTypeTrait on windows.

* fix bugs of the _create_accumulators func in Momentum.

* fix bug when inserting post cast op.

* Add the update_loss_scaling op in allow_set of UnusedVarCheck.

* Update for ci coverage.

* Add some doc for OptimizerWithMixedPrecision.

* Fix the code style.

* Imporve the doc of `amp_init`.

* Change for fp16 testing if users have the infer program defined in separate way.
lanxianghit pushed a commit that referenced this pull request Jan 11, 2021
* Support pure fp16 training for AMP API. (#29544)

* add cast ops before and after unsupported fp16 ops.

* Keep partial net in FP32 pattern.

* Support check_finite_and_unscale and update_loss_scaling for FP16 calculation mode.

* Add fp16 support for adam op.

* add multi precision attr for adam.

* Fix the bug of test_multi_precision_fp16_train UT.

* Code format for CI.

* Fix the redefine error about MPTypeTrait on windows.

* fix bugs of the _create_accumulators func in Momentum.

* fix bug when inserting post cast op.

* Add the update_loss_scaling op in allow_set of UnusedVarCheck.

* Update for ci coverage.

* Add some doc for OptimizerWithMixedPrecision.

* Fix the code style.

* Imporve the doc of `amp_init`.

* Change for fp16 testing if users have the infer program defined in separate way.

* Remove tensor copy in the update_loss_scaling op. (#29426)

* remove tensor copy in the update_loss_scaling op

* not use thrust.

* fix some cuda memory access error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants