fuse L2Decay and momentum when param.regularizer is set #32845

zhangting2020 · 2021-05-11T11:59:21Z

PR types

Performance optimization

PR changes

Others

Describe

fuse L2Decay and momentum when param.regularizer is set

before

当前Paddle支持momentum + L2Decay的融合：

当weight_decay被设置为float或者L2Decay时，会将py_regular设置为None，避免做weight_decay

Paddle/python/paddle/optimizer/momentum.py

Lines 108 to 115 in 1ef2327

    
           predicate = lambda regular: isinstance(regular, (L2DecayRegularizer, float)) 
        
           py_regular = None if predicate(weight_decay) else weight_decay 
        
           super(Momentum, self).__init__( 
        
               learning_rate=learning_rate, 
        
               parameters=parameters, 
        
               weight_decay=py_regular, 
        
               grad_clip=grad_clip, 
        
               name=name)

之后在_append_optimize_op时，通过设置momentum op的以下参数，将weight_decay和momentum计算都在momentum op中完成，达到融合的目的

Paddle/python/paddle/optimizer/momentum.py

Lines 209 to 210 in 1ef2327

    
           'use_nesterov', self._use_nesterov, 'regularization_method', 
        
           self._regularization_method, 'regularization_coeff',

但是如果模型中通过momentum的weight_decay参数设置了全局的regularizer=L2Decay，但是某些层又通过paddle.ParamAttr设置了特定的regularizer，则会发生以下情况：

（1）首先Momentum API中，由于符合融合条件，将设置self.regularization=None，同时设置self._regularization_method和self._regularization_coeff，用于之后在_append_optimize_op时，设置momentum op的参数，以实现融合

（2）反向有2个重要过程 append_regularization_ops(params_grads, self.regularization) 以及 self._create_optimization_pass(params_grads)

其中append_regularization_ops中调用_create_regularization_of_grad完成weight_decay，如下代码，会执行param的regularizer

Paddle/python/paddle/fluid/regularizer.py

Lines 25 to 40 in 1ef2327

    
           def _create_regularization_of_grad(param, grad, regularization=None): 
        
               """ Create and add backward regularization Operators 
        
               Function helper of append_regularization_ops. 
        
               """ 
        
               # If no gradient or no regularization is specified,  then we don't need to do anything 
        
               if grad is None or ((not hasattr(param, 'regularizer') or ( 
        
                       hasattr(param, 'regularizer') and param.regularizer is None)) and 
        
                                   regularization is None): 
        
                   return grad 
        
               regularization_term = None 
        
               if hasattr(param, 'regularizer') and param.regularizer is not None: 
        
                   # Add variable for regularization term in grad block 
        
                   regularization_term = param.regularizer(param, grad, grad.block) 
        
               elif regularization is not None: 
        
                   regularization_term = regularization(param, grad, grad.block)

接下来，在self._create_optimization_pass中，就调用到了momentum的_append_optimize_op，因（1）中设置了self._regularization_method和self._regularization_coeff，将会导致momentum op中再次做weight_decay

after

由于在append_regularization_ops(params_grads, self.regularization) 中会遍历所有参数，执行参数的regularization。如果是使用momentum，则需要在遍历参数时，判断参数的regularizer是否为L2Decay，如果是，则跳过做regularization。然后在_append_optimize_op时，去设置momentum op的regularization_method参数。因此本PR做了以下修改：

将原来/python/paddle/fluid/regularizer.py 文件中的append_regularization_ops和_create_regularization_of_grad删除，移动到了optimizer.py文件中，作为Optimizer Class的实例方法。这样保证了不影响到其他优化器。

Paddle/python/paddle/fluid/regularizer.py

Lines 25 to 108 in 5fa44c3

    
           def _create_regularization_of_grad(param, grad, regularization=None): 
        
               """ Create and add backward regularization Operators 
        
               Function helper of append_regularization_ops. 
        
               """ 
        
               # If no gradient or no regularization is specified,  then we don't need to do anything 
        
               if grad is None or ((not hasattr(param, 'regularizer') or ( 
        
                       hasattr(param, 'regularizer') and param.regularizer is None)) and 
        
                                   regularization is None): 
        
                   return grad 
        
               regularization_term = None 
        
               if hasattr(param, 'regularizer') and param.regularizer is not None: 
        
                   # Add variable for regularization term in grad block 
        
                   regularization_term = param.regularizer(param, grad, grad.block) 
        
               elif regularization is not None: 
        
                   regularization_term = regularization(param, grad, grad.block) 
        
               assert regularization_term is not None 
        
               new_grad = grad 
        
               if grad.type == core.VarDesc.VarType.SELECTED_ROWS: 
        
                   # FIXME(zcd): If the grad is SELECTED_ROWS, after regularization, 
        
                   # the grad's type and name will be changed. But the gradient's name 
        
                   # is used in ParallelExecutor Reduce mode, so I add a flag for 
        
                   # the new_grad here. 
        
                   new_grad = grad.block.create_var( 
        
                       name=grad.name + core.kNewGradSuffix(), 
        
                       dtype=param.dtype, 
        
                       shape=param.shape, 
        
                       lod_level=param.lod_level, 
        
                       type=core.VarDesc.VarType.LOD_TENSOR) 
        
               inputs = {"X": [grad, regularization_term]} 
        
               outputs = {"Out": [new_grad]} 
        
               if in_dygraph_mode(): 
        
                   new_grad = core.ops.sum([grad, regularization_term]) 
        
               else: 
        
                   grad.block.append_op(type='sum', inputs=inputs, outputs=outputs) 
        
               return new_grad 
        
           def append_regularization_ops(parameters_and_grads, regularization=None): 
        
               r"""Create and add backward regularization Operators 
        
               Creates and adds backward regularization operators in the BlockDesc. 
        
               This will add gradients of the regularizer function to the gradients 
        
               of the parameters and return these modified gradients. This is the 
        
               same as implementing weight decay in optimizers for regularization. 
        
               Args: 
        
                   parameters_and_grads: A list of (parameters, gradients) pairs 
        
                                         that need to be regularized. 
        
                   regularization: A global regularizer. If the parameter is not 
        
                                   set. It will be applied with regularizer. 
        
               Returns: 
        
                   list[(Variable, Variable)]: list of (parameters, gradients) \ 
        
                   pair with the regularized gradient 
        
               Raises: 
        
                   Exception: Unknown regularization type 
        
               """ 
        
               params_and_grads = [] 
        
               if in_dygraph_mode(): 
        
                   for param, grad in parameters_and_grads: 
        
                       new_grad = _create_regularization_of_grad(param, grad, 
        
                                                                 regularization) 
        
                       params_and_grads.append((param, new_grad)) 
        
               else: 
        
                   repeate_regularizer = False 
        
                   with framework.name_scope('regularization'): 
        
                       for param, grad in parameters_and_grads: 
        
                           if not repeate_regularizer and param.regularizer is not None and regularization is not None: 
        
                               repeate_regularizer = True 
        
                               logging.info( 
        
                                   "If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. " 
        
                                   "The Regularization[%s] in Optimizer will not take effect, and it will only be applied to other Parameters!" 
        
                                   % regularization.__str__()) 
        
                           with param.block.program._optimized_guard([param, grad]): 
        
                               new_grad = _create_regularization_of_grad(param, grad, 
        
                                                                         regularization) 
        
                               params_and_grads.append((param, new_grad)) 
        
               return params_and_grads

为Momentum重写了_create_regularization_of_grad方法，和父类此方法唯一的区别是：当param设置了L2Decay，就直接跳过参数的regularization。具体参考本PR中momentum.py文件的修改：

    def _create_regularization_of_grad(self, param, grad, regularization=None):
        """ Create and add backward regularization Operators
    
        Function helper of append_regularization_ops.
        """
        # If ParamAttr is set to L2Decay, we skip doing regularization here. And then we fused
        # L2Decay with momentum which can refer to _append_optimize_op below.
        if hasattr(param, 'regularizer') and isinstance(param.regularizer,
                                                        L2DecayRegularizer):
            return grad
        return super(Momentum, self)._create_regularization_of_grad(
            param, grad, regularization)

_append_optimize_op：在为每个参数append optimizer op前，添加如下代码，保证了该参数指定的L2DecayRegularizer具有最高优先级。如果参数本身没有设置regularizer，那么依然使用的是全局的regularizer设置。

        if hasattr(param, 'regularizer'):
            # we skip param's l2decay before, so fuse it with momentum here.
            if isinstance(param.regularizer, L2DecayRegularizer):
                self._regularization_method = "l2_decay"
                self._regularization_coeff = param.regularizer._regularization_coeff
            # the param's regularization has been done before, we avoid do l2decay in momentum.
            elif param.regularizer is not None:
                self._regularization_method = ""
                self._regularization_coeff = 0

综上，只要参数指定的regularizer是L2Decay，就会用该参数的regularizer替代全局的设置，避免了进行2次regularization，同时依然达到融合的效果。

performance

拿TSM进行测试，该模型为一些参数设置了自己的regularizer=L2Decay，bug修复前，会导致某些参数进行2次regularization。从profile report中可以看到，会有多次scale和sum的调用。

修复前

-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
BufferedReader:MemoryCopy                11          1840.16     1840.155303 (1.000000)  0.000000 (0.000000)     87.699      914.545     167.287     0.256218
GpuMemcpySync:GPU->CPU                   50          1298.04     1297.940395 (0.999920)  0.103584 (0.000080)     0.028043    133.824     25.9609     0.180736
conv2d_grad                              530         803.236     184.263699 (0.229402)   618.971983 (0.770598)   1.02905     2.70643     1.51554     0.11184
conv2d                                   530         504.971     176.378244 (0.349284)   328.593083 (0.650716)   0.564997    8.46123     0.952776    0.0703108
  cast                                   540         42.9775     37.497626 (0.872495)    5.479826 (0.127505)     0.036599    5.4637      0.0795879   0.00598406
softmax_with_cross_entropy               10          480.287     480.170577 (0.999758)   0.116033 (0.000242)     36.9285     58.1924     48.0287     0.0668738
  GpuMemcpySync:CUDAPinned->GPU          10          478.828     478.809623 (0.999963)   0.017889 (0.000037)     36.7578     58.0672     47.8828     0.0666706
batch_norm_grad                          530         353.873     37.233877 (0.105218)    316.638753 (0.894782)   0.130019    2.42401     0.667684    0.0492722
batch_norm                               530         300.522     77.403224 (0.257563)    223.118328 (0.742437)   0.17553     3.34711     0.567022    0.0418438
temporal_shift_grad                      160         287.278     6.675565 (0.023237)     280.602745 (0.976763)   0.518944    4.18037     1.79549     0.0399998
reshape2                                 40          249.519     132.272389 (0.530109)   117.246915 (0.469891)   0.018984    36.5348     6.23798     0.0347424
  GpuMemcpySync:CUDAPinned->GPU          10          243.271     126.024500 (0.518041)   117.246915 (0.481959)   23.4653     31.7538     24.3271     0.0338724
relu_grad                                490         180.565     12.798285 (0.070879)    167.766331 (0.929121)   0.068427    1.43153     0.368499    0.0251413
relu                                     490         137.57      19.896579 (0.144628)    117.673761 (0.855372)   0.069271    1.06388     0.280756    0.0191549
temporal_shift                           160         125.477     6.973246 (0.055574)     118.503274 (0.944426)   0.268912    1.84501     0.784228    0.017471
elementwise_add_grad                     170         106.375     5.121596 (0.048147)     101.253481 (0.951853)   0.05269     1.50784     0.625736    0.0148114
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.145483    0.132651 (0.911797)     0.012832 (0.088203)     0.013715    0.01677     0.0145483   2.02566e-05
elementwise_add                          170         104.823     10.197299 (0.097282)    94.625259 (0.902718)    0.096113    1.46551     0.616603    0.0145952
  cast                                   10          0.388427    0.370507 (0.953865)     0.017920 (0.046135)     0.035419    0.049115    0.0388427   5.40835e-05
reduce_sum                               1620        81.8967     77.897542 (0.951168)    3.999151 (0.048832)     0.034907    0.21187     0.0505535   0.0114031
elementwise_mul                          1620        52.9484     48.119899 (0.908807)    4.828500 (0.091193)     0.025545    0.077848    0.0326842   0.00737239
momentum                                 1610        52.8795     44.607367 (0.843567)    8.272096 (0.156433)     0.024489    0.101409    0.0328444   0.00736279
square                                   1610        46.2763     42.074616 (0.909204)    4.201711 (0.090796)     0.021599    0.229078    0.0287431   0.00644339
pool2d_grad                              20          42.9491     1.158003 (0.026962)     41.791107 (0.973038)    0.990633    3.42645     2.14746     0.00598011
scale                                    1120        31.4837     29.965698 (0.951785)    1.517989 (0.048215)     0.022827    0.112202    0.0281104   0.0043837
sum                                      1080        30.8782     29.404994 (0.952289)    1.473219 (0.047711)     0.023598    0.115157    0.0285909   0.0042994
ClearGradient                            1610        19.811      16.890283 (0.852570)    2.920743 (0.147430)     0.009895    0.061578    0.012305    0.00275843
cast                                     550         15.0069     12.385047 (0.825290)    2.621865 (0.174710)     0.019099    0.063333    0.0272853   0.00208952
pool2d                                   20          9.58054     1.805983 (0.188505)     7.774558 (0.811495)     0.178302    0.846014    0.479027    0.00133397
check_finite_and_unscale                 10          5.66584     2.835464 (0.500449)     2.830379 (0.499551)     0.532772    0.611612    0.566584    0.000788896
  GpuMemcpyAsync:CPU->GPU                20          0.460414    0.422398 (0.917431)     0.038016 (0.082569)     0.012238    0.037567    0.0230207   6.41068e-05
top_k                                    20          4.05014     2.910747 (0.718678)     1.139395 (0.281322)     0.162751    0.25443     0.202507    0.00056393
matmul                                   10          2.26218     2.049406 (0.905945)     0.212769 (0.094055)     0.205148    0.268485    0.226218    0.000314979
  cast                                   10          0.428644    0.402468 (0.938933)     0.026176 (0.061067)     0.040432    0.051356    0.0428644   5.96832e-05
concat                                   10          2.19238     2.143129 (0.977536)     0.049249 (0.022464)     0.170437    0.323567    0.219238    0.000305261
  GpuMemcpyAsync:CPU->GPU                10          0.365593    0.347161 (0.949583)     0.018432 (0.050417)     0.027336    0.055756    0.0365593   5.09041e-05
matmul_grad                              10          1.5975      1.266170 (0.792594)     0.331332 (0.207406)     0.149483    0.180775    0.15975     0.000222432
fill_constant                            20          1.4446      1.420055 (0.983010)     0.024544 (0.016990)     0.052769    0.108241    0.07223     0.000201142
accuracy                                 20          1.08109     0.943103 (0.872365)     0.137984 (0.127635)     0.041539    0.072848    0.0540543   0.000150528
reshape2_grad                            30          0.993977    0.943513 (0.949230)     0.050464 (0.050770)     0.028249    0.043074    0.0331326   0.000138399
  GpuMemcpyAsync(same_gpu):GPU->GPU      30          0.586984    0.536520 (0.914028)     0.050464 (0.085972)     0.015645    0.029971    0.0195661   8.173e-05
softmax_with_cross_entropy_grad          10          0.949492    0.892500 (0.939976)     0.056992 (0.060024)     0.083984    0.113048    0.0949492   0.000132205
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.37421     0.360642 (0.963742)     0.013568 (0.036258)     0.033563    0.042212    0.037421    5.2104e-05
elementwise_max                          10          0.813877    0.799477 (0.982307)     0.014400 (0.017693)     0.066026    0.124007    0.0813877   0.000113322
reduce_mean                              10          0.722829    0.650829 (0.900391)     0.072000 (0.099609)     0.066224    0.085834    0.0722829   0.000100645
dropout                                  10          0.682767    0.621263 (0.909919)     0.061504 (0.090081)     0.061739    0.093977    0.0682767   9.50666e-05
mean                                     10          0.6813      0.657172 (0.964585)     0.024128 (0.035415)     0.06315     0.084201    0.06813     9.48623e-05
elementwise_mul_grad                     10          0.551161    0.537145 (0.974570)     0.014016 (0.025430)     0.042076    0.099413    0.0551161   7.67421e-05
elementwise_div                          10          0.438527    0.424031 (0.966944)     0.014496 (0.033056)     0.035509    0.068166    0.0438527   6.10593e-05
reduce_mean_grad                         10          0.407826    0.369522 (0.906078)     0.038304 (0.093922)     0.037561    0.046573    0.0407826   5.67845e-05
dropout_grad                             10          0.39087     0.356566 (0.912237)     0.034304 (0.087763)     0.035845    0.043486    0.039087    5.44236e-05
sqrt                                     10          0.37579     0.357038 (0.950100)     0.018752 (0.049900)     0.031394    0.059225    0.037579    5.23239e-05
mean_grad                                10          0.282389    0.268181 (0.949686)     0.014208 (0.050314)     0.024482    0.033846    0.0282389   3.93191e-05

修复后：多做的这部分scale和sum就被消除了

-------------------------       Event Summary       -------------------------

Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
BufferedReader:MemoryCopy                11          2039.99     2039.987536 (1.000000)  0.000000 (0.000000)     87.3453     1077.69     185.453     0.281856
GpuMemcpySync:GPU->CPU                   50          1279.59     1279.485086 (0.999919)  0.103712 (0.000081)     0.02676     130.373     25.5918     0.176795
conv2d_grad                              530         806.911     191.328553 (0.237112)   615.582408 (0.762888)   1.02098     2.85402     1.52247     0.111487
conv2d                                   530         501.536     174.536693 (0.348004)   326.999217 (0.651996)   0.563483    6.97771     0.946294    0.0692949
  cast                                   540         33.4735     28.018821 (0.837045)    5.454658 (0.162955)     0.037195    3.43293     0.0619879   0.00462488
softmax_with_cross_entropy               10          439.398     439.334647 (0.999856)   0.063232 (0.000144)     12.0012     50.9567     43.9398     0.0607096
  GpuMemcpySync:CUDAPinned->GPU          10          438.112     438.093984 (0.999958)   0.018464 (0.000042)     11.8627     50.8194     43.8112     0.060532
batch_norm_grad                          530         354.622     38.316624 (0.108049)    316.305055 (0.891951)   0.129126    2.42289     0.669098    0.0489964
batch_norm                               530         304.848     82.266444 (0.269861)    222.581451 (0.730139)   0.173148    1.83929     0.575185    0.0421194
temporal_shift_grad                      160         287.232     6.774842 (0.023587)     280.457419 (0.976413)   0.517822    4.14435     1.7952      0.0396855
reshape2                                 40          258.072     140.727420 (0.545302)   117.344955 (0.454698)   0.018812    38.3899     6.45181     0.0356567
  GpuMemcpySync:CUDAPinned->GPU          10          247.936     130.591178 (0.526713)   117.344955 (0.473287)   23.4667     30.0521     24.7936     0.0342562
relu_grad                                490         178.291     13.779154 (0.077285)    164.511777 (0.922715)   0.068098    1.40939     0.363859    0.0246336
relu                                     490         137.099     22.466188 (0.163868)    114.632853 (0.836132)   0.068474    1.07078     0.279794    0.0189423
temporal_shift                           160         125.693     7.633702 (0.060733)     118.059392 (0.939267)   0.267909    1.89513     0.785582    0.0173664
elementwise_add_grad                     170         106.368     5.216390 (0.049041)     101.152051 (0.950959)   0.052676    1.50915     0.625697    0.0146964
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.159193    0.146297 (0.918991)     0.012896 (0.081009)     0.013933    0.023808    0.0159193   2.1995e-05
elementwise_add                          170         106.054     11.438158 (0.107852)    94.616143 (0.892148)    0.100335    1.51075     0.623849    0.014653
  cast                                   10          0.421136    0.403664 (0.958512)     0.017472 (0.041488)     0.035507    0.063357    0.0421136   5.81864e-05
reduce_sum                               1620        68.1046     64.061045 (0.940627)    4.043587 (0.059373)     0.032251    0.217129    0.0420399   0.0094097
momentum                                 1610        46.9955     38.774473 (0.825068)    8.221030 (0.174932)     0.0237      0.100193    0.0291898   0.00649315
elementwise_mul                          1620        46.2526     41.393091 (0.894936)    4.859490 (0.105064)     0.024591    0.092522    0.028551    0.0063905
pool2d_grad                              20          42.7475     1.207247 (0.028241)     41.540218 (0.971759)    0.992029    3.2843      2.13737     0.00590622
square                                   1610        38.5644     34.540008 (0.895646)    4.024354 (0.104354)     0.020296    0.132276    0.023953    0.00532826
ClearGradient                            1610        17.6084     14.661259 (0.832630)    2.947104 (0.167370)     0.009408    0.035958    0.0109369   0.00243287
cast                                     550         15.4529     12.836667 (0.830695)    2.616257 (0.169305)     0.018939    0.067145    0.0280962   0.00213506
pool2d                                   20          9.62612     1.848928 (0.192074)     7.777188 (0.807926)     0.175864    0.893863    0.481306    0.00133
check_finite_and_unscale                 10          5.56346     2.737282 (0.492011)     2.826177 (0.507989)     0.529087    0.619342    0.556346    0.000768677
  GpuMemcpyAsync:CPU->GPU                20          0.45365     0.415570 (0.916059)     0.038080 (0.083941)     0.011739    0.03796     0.0226825   6.26787e-05
top_k                                    20          4.18323     3.251266 (0.777213)     0.931969 (0.222787)     0.161114    0.304403    0.209162    0.000577978
matmul                                   10          2.35787     2.144747 (0.909613)     0.213120 (0.090387)     0.207042    0.348031    0.235787    0.000325776
  cast                                   10          0.472581    0.447557 (0.947048)     0.025024 (0.052952)     0.040221    0.068483    0.0472581   6.52943e-05
concat                                   10          1.91381     1.869490 (0.976842)     0.044320 (0.023158)     0.179366    0.210768    0.191381    0.000264422
  GpuMemcpyAsync:CPU->GPU                10          0.325034    0.306922 (0.944277)     0.018112 (0.055723)     0.028022    0.037014    0.0325034   4.49084e-05
scale                                    40          1.61485     1.562497 (0.967581)     0.052352 (0.032419)     0.024309    0.077732    0.0403712   0.000223116
matmul_grad                              10          1.58093     1.251904 (0.791879)     0.329024 (0.208121)     0.149025    0.185475    0.158093    0.000218429
fill_constant                            20          1.33589     1.311374 (0.981651)     0.024512 (0.018349)     0.057007    0.077949    0.0667943   0.000184573
accuracy                                 20          1.11247     0.974321 (0.875822)     0.138144 (0.124178)     0.040128    0.081296    0.0556232   0.000153704
reshape2_grad                            30          1.05939     1.008763 (0.952214)     0.050624 (0.047786)     0.028512    0.051742    0.0353129   0.000146371
  GpuMemcpyAsync(same_gpu):GPU->GPU      30          0.625437    0.574813 (0.919058)     0.050624 (0.080942)     0.014734    0.032356    0.0208479   8.64137e-05
softmax_with_cross_entropy_grad          10          0.798398    0.757342 (0.948577)     0.041056 (0.051423)     0.068995    0.118012    0.0798398   0.000110311
  GpuMemcpyAsync(same_gpu):GPU->GPU      10          0.360293    0.346021 (0.960388)     0.014272 (0.039612)     0.029698    0.052553    0.0360293   4.978e-05
reduce_mean                              10          0.780475    0.709627 (0.909225)     0.070848 (0.090775)     0.06818     0.111653    0.0780475   0.000107835
dropout                                  10          0.728231    0.666599 (0.915368)     0.061632 (0.084632)     0.063023    0.112013    0.0728231   0.000100616
mean                                     10          0.690836    0.665652 (0.963546)     0.025184 (0.036454)     0.059973    0.088384    0.0690836   9.54496e-05
elementwise_max                          10          0.683114    0.667818 (0.977608)     0.015296 (0.022392)     0.064246    0.077212    0.0683114   9.43827e-05
elementwise_mul_grad                     10          0.4944      0.480640 (0.972168)     0.013760 (0.027832)     0.044396    0.064752    0.04944     6.8309e-05
reduce_mean_grad                         10          0.43625     0.395898 (0.907503)     0.040352 (0.092497)     0.037647    0.056163    0.043625    6.02746e-05
dropout_grad                             10          0.379695    0.344239 (0.906620)     0.035456 (0.093380)     0.034048    0.046717    0.0379695   5.24607e-05
elementwise_div                          10          0.352212    0.337652 (0.958661)     0.014560 (0.041339)     0.034394    0.037972    0.0352212   4.86635e-05
sqrt                                     10          0.30973     0.291586 (0.941420)     0.018144 (0.058580)     0.029357    0.039883    0.030973    4.2794e-05
mean_grad                                10          0.276216    0.262040 (0.948678)     0.014176 (0.051322)     0.024579    0.037008    0.0276216   3.81635e-05

同时该bug可能还影响了收敛速度和精度。对比了bug修复前的训练log：

修复前：40个epoch达到0.7018
修复后：27个epoch附近已经达到0.7036

paddle-bot-old · 2021-05-11T11:59:24Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2021-05-20T02:35:26Z

Sorry to inform you that e88475d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Xreki · 2021-06-04T06:48:14Z

python/paddle/optimizer/momentum.py

+        if framework.in_dygraph_mode():
+            new_grad = core.ops.sum([grad, regularization_term])
+        else:
+            grad.block.append_op(type='sum', inputs=inputs, outputs=outputs)


L270 - L297可以直接写成调用基类的_create_regularization_of_grad函数？

Xreki

LGTM

zhiqiu

LGTM

…#32845)

#32845) (#32881) fuse L2Decay and momentum when param.regularizer is set cherry-pick #32845

zhangting2020 force-pushed the weight_decay branch 2 times, most recently from ee3c9b3 to 47558e3 Compare May 12, 2021 06:37

zhangting2020 force-pushed the weight_decay branch from 47558e3 to 58eda79 Compare May 12, 2021 08:01

zhangting2020 mentioned this pull request May 12, 2021

[cherry-pick] fuse L2Decay and momentum when param.regularizer is set #32881

Merged

zhangting2020 force-pushed the weight_decay branch 4 times, most recently from d0439a9 to 4621e0c Compare May 27, 2021 06:31

zhangting2020 force-pushed the weight_decay branch from 4621e0c to b0ca588 Compare June 1, 2021 03:46

zhangting2020 force-pushed the weight_decay branch from b0ca588 to 83ab8c5 Compare June 1, 2021 06:32

Xreki reviewed Jun 4, 2021

View reviewed changes

Xreki previously approved these changes Jun 4, 2021

View reviewed changes

zhangting2020 dismissed Xreki’s stale review via 6b21f7f June 9, 2021 06:04

zhangting2020 added 5 commits June 9, 2021 06:04

fuse L2Decay and momentum when param.regularizer is set

75c6ebd

add unittest

8bd8185

refine

cb9dfe1

refine _create_regularization_of_grad of momentum

688df62

improve append_optimizer_op

c7fa29e

zhangting2020 force-pushed the weight_decay branch from 6b21f7f to c7fa29e Compare June 9, 2021 06:05

zhiqiu approved these changes Jun 9, 2021

View reviewed changes

zhangting2020 merged commit a526b3e into PaddlePaddle:develop Jun 10, 2021

zhangting2020 added a commit to zhangting2020/Paddle that referenced this pull request Jun 10, 2021

fuse L2Decay and momentum when param.regularizer is set (PaddlePaddle…

e9978b8

…#32845)

lanxianghit pushed a commit that referenced this pull request Jun 10, 2021

[cherry-pick] fuse L2Decay and momentum when param.regularizer is set (

dfa05da

#32845) (#32881) fuse L2Decay and momentum when param.regularizer is set cherry-pick #32845

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuse L2Decay and momentum when param.regularizer is set #32845

fuse L2Decay and momentum when param.regularizer is set #32845

zhangting2020 commented May 11, 2021 •

edited

Loading

paddle-bot-old bot commented May 11, 2021

paddle-bot-old bot commented May 20, 2021

Xreki Jun 4, 2021

zhangting2020 Jun 4, 2021

Xreki left a comment

zhiqiu left a comment

	predicate = lambda regular: isinstance(regular, (L2DecayRegularizer, float))
	py_regular = None if predicate(weight_decay) else weight_decay
	super(Momentum, self).__init__(
	learning_rate=learning_rate,
	parameters=parameters,
	weight_decay=py_regular,
	grad_clip=grad_clip,
	name=name)

	'use_nesterov', self._use_nesterov, 'regularization_method',
	self._regularization_method, 'regularization_coeff',

	def _create_regularization_of_grad(param, grad, regularization=None):
	""" Create and add backward regularization Operators

	Function helper of append_regularization_ops.
	"""
	# If no gradient or no regularization is specified, then we don't need to do anything
	if grad is None or ((not hasattr(param, 'regularizer') or (
	hasattr(param, 'regularizer') and param.regularizer is None)) and
	regularization is None):
	return grad
	regularization_term = None
	if hasattr(param, 'regularizer') and param.regularizer is not None:
	# Add variable for regularization term in grad block
	regularization_term = param.regularizer(param, grad, grad.block)
	elif regularization is not None:
	regularization_term = regularization(param, grad, grad.block)

fuse L2Decay and momentum when param.regularizer is set #32845

fuse L2Decay and momentum when param.regularizer is set #32845

Conversation

zhangting2020 commented May 11, 2021 • edited Loading

PR types

PR changes

Describe

before

after

performance

paddle-bot-old bot commented May 11, 2021

paddle-bot-old bot commented May 20, 2021

Xreki Jun 4, 2021

Choose a reason for hiding this comment

zhangting2020 Jun 4, 2021

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

zhangting2020 commented May 11, 2021 •

edited

Loading