Skip to content

Conversation

@wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Sep 13, 2021

PR types

Bug fixes

PR changes

Others

Describe

修复 #33565 引入的精度bug。

产生原因

上述PR在GradientClipByGlobalNorm里插入了两个sum op,导致混合并行中同样也会插入两个c_allreduce_sum。
原本global_norm = 9 = c_allreduce_sum(5, 4),增加一个c_allreduce_sum后,global_norm = 18 = c_allreduce_sum(9, 9),导致出错。

修复方案

临时修复,判断global_norm_var_list的个数,为1则保持原始逻辑只有一个sum。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@wangxicoding wangxicoding force-pushed the fix_hybrid_clip_grad_global_norm branch from a480513 to 8414446 Compare September 13, 2021 11:36
@wangxicoding wangxicoding force-pushed the fix_hybrid_clip_grad_global_norm branch from 8414446 to 604140c Compare September 13, 2021 12:42
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongweibao gongweibao merged commit 598d32d into PaddlePaddle:develop Sep 14, 2021
@wangxicoding wangxicoding deleted the fix_hybrid_clip_grad_global_norm branch September 14, 2021 06:27
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants