-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse L2Decay and momentum when param.regularizer is set #32845
fuse L2Decay and momentum when param.regularizer is set #32845
Conversation
Thanks for your contribution! |
ee3c9b3
to
47558e3
Compare
47558e3
to
58eda79
Compare
Sorry to inform you that e88475d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
d0439a9
to
4621e0c
Compare
4621e0c
to
b0ca588
Compare
b0ca588
to
83ab8c5
Compare
python/paddle/optimizer/momentum.py
Outdated
if framework.in_dygraph_mode(): | ||
new_grad = core.ops.sum([grad, regularization_term]) | ||
else: | ||
grad.block.append_op(type='sum', inputs=inputs, outputs=outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L270 - L297可以直接写成调用基类的_create_regularization_of_grad
函数?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6b21f7f
to
c7fa29e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimizationPR changes
OthersDescribe
fuse L2Decay and momentum when param.regularizer is setbefore
当前Paddle支持momentum + L2Decay的融合:
Paddle/python/paddle/optimizer/momentum.py
Lines 108 to 115 in 1ef2327
_append_optimize_op
时,通过设置momentum op的以下参数,将weight_decay和momentum计算都在momentum op中完成,达到融合的目的Paddle/python/paddle/optimizer/momentum.py
Lines 209 to 210 in 1ef2327
但是如果模型中通过momentum的weight_decay参数设置了全局的regularizer=L2Decay,但是某些层又通过paddle.ParamAttr设置了特定的regularizer,则会发生以下情况:
_append_optimize_op
时,设置momentum op的参数,以实现融合append_regularization_ops(params_grads, self.regularization)
以及self._create_optimization_pass(params_grads)
_create_regularization_of_grad
完成weight_decay,如下代码,会执行param的regularizerPaddle/python/paddle/fluid/regularizer.py
Lines 25 to 40 in 1ef2327
_append_optimize_op
,因(1)中设置了self._regularization_method
和self._regularization_coeff
,将会导致momentum op中再次做weight_decayafter
由于在
append_regularization_ops(params_grads, self.regularization)
中会遍历所有参数,执行参数的regularization。如果是使用momentum,则需要在遍历参数时,判断参数的regularizer是否为L2Decay,如果是,则跳过做regularization。然后在_append_optimize_op
时,去设置momentum op的regularization_method
参数。因此本PR做了以下修改:append_regularization_ops
和_create_regularization_of_grad
删除,移动到了optimizer.py文件中,作为Optimizer Class的实例方法。这样保证了不影响到其他优化器。Paddle/python/paddle/fluid/regularizer.py
Lines 25 to 108 in 5fa44c3
_create_regularization_of_grad
方法,和父类此方法唯一的区别是:当param设置了L2Decay,就直接跳过参数的regularization。具体参考本PR中momentum.py文件的修改:综上,只要参数指定的regularizer是L2Decay,就会用该参数的regularizer替代全局的设置,避免了进行2次regularization,同时依然达到融合的效果。
performance
拿TSM进行测试,该模型为一些参数设置了自己的regularizer=L2Decay,bug修复前,会导致某些参数进行2次regularization。从profile report中可以看到,会有多次scale和sum的调用。
同时该bug可能还影响了收敛速度和精度。对比了bug修复前的训练log: