【prim】New layer_norm grad (PaddlePaddle#51750)

xiaoguoguo626807 · xysheng-baidu · Aurelius84 · web-flow · commit 802a81d021b1 · 2023-03-20T17:12:09.000+08:00
* Add flatten composite rule * get the right xshape and pass func test * add cinn unit test * Remove cinn test, wait for it to be added after repair * add comp test to test_flatten_contiguous_range_op.py * remove func test on composite_ops * Add comments to maybe_wrap_dim func * remove commented code * fix the problem with 0D tensor case * add flatten split rule comment * fix syntax issues * block flatten on resnet_prim_cinn * init change * tmp commit * add layer_norm InferMeta check * cast type modify * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) * [CINN]Enhance CacheKey hash logic by considering input dtypes * add unittest * fix typo * fix typo * fix map.at * fix find * fix test * fix cinn cache key structure realize * using ordered map for attributes * add test by review advice --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * Pr 50885 (#7) * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) * [CINN]Enhance CacheKey hash logic by considering input dtypes * add unittest * fix typo * fix typo * fix map.at * fix find * fix test * fix cinn cache key structure realize * using ordered map for attributes * add test by review advice --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * fix code in a dy2static-friendly way. * [dystatic] add hooker for prim --------- Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: jiangcheng <thisjiang@qq.com> Co-authored-by: cxxly <chenxx_id@163.com> * [prim] enable dygraph_to_static to support custom_vjp * fix cast prim and vjp dtype mapping error bug * recover * big tol * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) * [CINN]Enhance CacheKey hash logic by considering input dtypes * add unittest * fix typo * fix typo * fix map.at * fix find * fix test * fix cinn cache key structure realize * using ordered map for attributes * add test by review advice --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * Pr 50885 (#7) * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) * [CINN]Enhance CacheKey hash logic by considering input dtypes * add unittest * fix typo * fix typo * fix map.at * fix find * fix test * fix cinn cache key structure realize * using ordered map for attributes * add test by review advice --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * fix code in a dy2static-friendly way. * [dystatic] add hooker for prim --------- Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: jiangcheng <thisjiang@qq.com> Co-authored-by: cxxly <chenxx_id@163.com> * [prim] enable dygraph_to_static to support custom_vjp * fix cast prim and vjp dtype mapping error bug * Cxx prim custom vjp (#8) * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * Pr 50885 (#7) * [CINN]Enhance CacheKey hash logic by considering input dtypes (PaddlePaddle#50557) * [CINN]Enhance CacheKey hash logic by considering input dtypes --------- Co-authored-by: jiangcheng <thisjiang@qq.com> * [prim] enable dygraph_to_static to support custom_vjp * fix code in a dy2static-friendly way. * [dystatic] add hooker for prim --------- Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: jiangcheng <thisjiang@qq.com> Co-authored-by: cxxly <chenxx_id@163.com> * [prim] enable dygraph_to_static to support custom_vjp * fix cast prim and vjp dtype mapping error bug * [dy2static-ci] fix dy2static ci errors. --------- Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: jiangcheng <thisjiang@qq.com> Co-authored-by: cxxly <chenxx_id@163.com> * [Prim] enable whitelist and blacklist for custom_vjp * debug log * clear log * fix * nothing * less memory * recover utils * fix * modify threshold value * skip layer_norm for test_bert * back to bert success state * add epsion * delete unnecessary compute * modify amp dtype * modify * order * delete sqrt check and fp16 --------- Co-authored-by: xuyongsheng <xuyongsheng@baidu.com> Co-authored-by: xysheng-baidu <121540080+xysheng-baidu@users.noreply.github.com> Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: jiangcheng <thisjiang@qq.com> Co-authored-by: cxxly <chenxx_id@163.com> Co-authored-by: xiongkun <807377414@qq.com>
diff --git a/paddle/fluid/operators/layer_norm_op.cc b/paddle/fluid/operators/layer_norm_op.cc
@@ -15,7 +15,13 @@ limitations under the License. */
 #include <memory>
 #include <string>
 
+#include "paddle/fluid/framework/infershape_utils.h"
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/prim/api/composite_backward/composite_backward_api.h"
+#include "paddle/fluid/prim/utils/static/composite_grad_desc_maker.h"
+#include "paddle/fluid/prim/utils/static/desc_tensor.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/ternary.h"
 
 namespace paddle {
 namespace operators {
@@ -253,15 +259,78 @@ class LayerNormGradOpMaker : public framework::SingleGradOpMaker<T> {
 DECLARE_NO_NEED_BUFFER_VARS_INFERER(LayerNormGradNoNeedBufferVarInferer,
                                     "Bias");
 
+class LayerNormCompositeGradOpMaker : public prim::CompositeGradOpMakerBase {
+  using prim::CompositeGradOpMakerBase::CompositeGradOpMakerBase;
+
+ public:
+  void Apply() override {
+    // get inputs
+    paddle::Tensor x = this->GetSingleForwardInput("X");
+    paddle::Tensor mean = this->GetSingleForwardOutput("Mean");
+    paddle::Tensor var = this->GetSingleForwardOutput("Variance");
+    paddle::Tensor y_grad = this->GetSingleOutputGrad("Y");
+    paddle::optional<paddle::Tensor> scale =
+        this->GetOptionalSingleForwardInput("Scale");
+    paddle::optional<paddle::Tensor> bias =
+        this->GetOptionalSingleForwardInput("Bias");
+
+    // get Attrs
+    auto epsilon = this->Attr<float>("epsilon");
+    auto begin_norm_axis = this->Attr<int>("begin_norm_axis");
+
+    // get outputs
+    paddle::Tensor x_grad = this->GetSingleInputGrad("X");
+    paddle::Tensor scale_grad = this->GetSingleInputGrad("Scale");
+    paddle::Tensor bias_grad = this->GetSingleInputGrad("Bias");
+
+    auto dx_ptr = this->GetOutputPtr(&x_grad);
+    std::string dx_name = this->GetOutputName(x_grad);
+    auto dscale_ptr = this->GetOutputPtr(&scale_grad);
+    std::string dscale_name = this->GetOutputName(scale_grad);
+    auto dbias_ptr = this->GetOutputPtr(&bias_grad);
+    std::string dbias_name = this->GetOutputName(bias_grad);
+
+    VLOG(6) << "Runing layer_norm_grad composite func";
+    prim::layer_norm_grad<prim::DescTensor>(x,
+                                            scale,
+                                            bias,
+                                            mean,
+                                            var,
+                                            y_grad,
+                                            epsilon,
+                                            begin_norm_axis,
+                                            dx_ptr,
+                                            dscale_ptr,
+                                            dbias_ptr);
+
+    this->RecoverOutputName(x_grad, dx_name);
+    this->RecoverOutputName(scale_grad, dscale_name);
+    this->RecoverOutputName(bias_grad, dbias_name);
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+
+DECLARE_INFER_SHAPE_FUNCTOR(layer_norm,
+                            LayerNormInferShapeFunctor,
+                            PD_INFER_META(phi::LayerNormInferMeta));
+
 REGISTER_OPERATOR(layer_norm,
                   ops::LayerNormOp,
                   ops::LayerNormOpMaker,
                   ops::LayerNormGradOpMaker<paddle::framework::OpDesc>,
-                  ops::LayerNormGradOpMaker<paddle::imperative::OpBase>);
+                  ops::LayerNormGradOpMaker<paddle::imperative::OpBase>,
+                  ops::LayerNormCompositeGradOpMaker,
+                  LayerNormInferShapeFunctor);
+
+DECLARE_INFER_SHAPE_FUNCTOR(layer_norm_grad,
+                            LayerNormGradInferShapeFunctor,
+                            PD_INFER_META(phi::LayerNormGradInferMeta));
+
 REGISTER_OPERATOR(layer_norm_grad,
                   ops::LayerNormGradOp,
-                  ops::LayerNormGradNoNeedBufferVarInferer);
+                  ops::LayerNormGradNoNeedBufferVarInferer,
+                  LayerNormGradInferShapeFunctor);
diff --git a/paddle/fluid/prim/api/api.yaml b/paddle/fluid/prim/api/api.yaml
@@ -35,6 +35,7 @@
 - tile
 - transpose
 - pad
+- sqrt
 - cumsum
 - put_along_axis
 - greater_than
diff --git a/paddle/fluid/prim/api/composite_backward/composite_backward_api.h b/paddle/fluid/prim/api/composite_backward/composite_backward_api.h
@@ -896,6 +896,101 @@ void slice_grad(const Tensor& input,
   }
 }
 
+template <typename T>
+void layer_norm_grad(const Tensor& x,
+                     const paddle::optional<Tensor>& scale,
+                     const paddle::optional<Tensor>& bias,
+                     const Tensor& mean,
+                     const Tensor& variance,
+                     const Tensor& out_grad,
+                     float epsilon,
+                     int begin_norm_axis,
+                     Tensor* x_grad,
+                     Tensor* scale_grad,
+                     Tensor* bias_grad) {
+  auto x_dims = x.dims();
+  auto shape_1 = 1;  // front part
+  auto shape_2 = 1;  // back part
+  for (int i = 0; i < begin_norm_axis; ++i) {
+    shape_1 *= x_dims[i];
+  }
+  for (int i = begin_norm_axis; i < x.dims().size(); ++i) {
+    shape_2 *= x_dims[i];
+  }
+  auto scale_ptr = scale.get_ptr();
+  auto bias_ptr = bias.get_ptr();
+
+  // cast dtype to float32 if dtype =float16
+  Tensor x_cast = x;
+  Tensor out_grad_cast = out_grad;
+  Tensor scale_cast;
+  if (scale_ptr) {
+    scale_cast = reshape<T>(*scale_ptr, std::vector<int64_t>({1, shape_2}));
+  }
+  if (x.dtype() == phi::DataType::FLOAT16) {
+    x_cast = cast<T>(x, phi::DataType::FLOAT32);
+    out_grad_cast = cast<T>(out_grad, phi::DataType::FLOAT32);
+    if (scale_ptr) {
+      scale_cast = cast<T>(scale_cast, phi::DataType::FLOAT32);
+    }
+  }
+
+  x_cast = reshape<T>(x_cast, std::vector<int64_t>({shape_1, shape_2}));
+  out_grad_cast =
+      reshape<T>(out_grad_cast, std::vector<int64_t>({shape_1, shape_2}));
+  auto mean_ = reshape<T>(mean, std::vector<int64_t>({shape_1, 1}));
+  auto variance_ = reshape<T>(variance, std::vector<int64_t>({shape_1, 1}));
+  if (bias_grad) {
+    if (bias_ptr) {
+      auto bias_grad_tmp =
+          out_grad_cast.sum(std::vector<int64_t>({0}), x_cast.dtype(), true);
+      bias_grad_tmp = reshape<T>(bias_grad_tmp, bias_ptr->shape());
+      set_output<T>(bias_grad_tmp, bias_grad);
+    } else {
+      bias_grad = nullptr;
+    }
+  }
+  auto x_sub_mean = x_cast - mean_;
+  auto tmp = (1.0 / (variance_ + epsilon));
+  auto sqrt_var_1 = sqrt<T>(tmp);
+  if (scale_grad) {
+    if (scale_ptr) {
+      auto scale_grad_tmp =
+          (x_sub_mean * sqrt_var_1 * out_grad_cast)
+              .sum(std::vector<int64_t>({0}), x_cast.dtype(), true);
+      scale_grad_tmp = reshape<T>(scale_grad_tmp, scale_ptr->shape());
+      set_output<T>(scale_grad_tmp, scale_grad);
+    } else {
+      scale_grad = nullptr;
+    }
+  }
+
+  if (x_grad) {
+    if (!scale_ptr) {
+      scale_cast =
+          full<T>(std::vector<int64_t>({1, shape_2}), 1.0, x_cast.dtype());
+    }
+    auto out_grad_scale = out_grad_cast * scale_cast;
+    auto dx_end = (sqrt_var_1 * out_grad_scale);
+    auto d_mean_0 =
+        (-dx_end).sum(std::vector<int64_t>({1}), x_cast.dtype(), true);
+    auto d_mean = (1.0 / shape_2) * d_mean_0;
+    auto d_std_1 = (-tmp * x_sub_mean * out_grad_scale)
+                       .sum(std::vector<int64_t>({1}), x_cast.dtype(), true);
+    auto d_std_2 = (1.0 / shape_2) * sqrt_var_1;
+    d_std_2 = reshape<T>(d_std_2, std::vector<int64_t>({shape_1, 1}));
+    d_std_2 = d_std_2 * x_sub_mean;
+    auto d_std = d_std_1 * d_std_2;
+
+    auto x_grad_tmp = dx_end + d_mean + d_std;
+    x_grad_tmp = reshape<T>(x_grad_tmp, phi::vectorize(x.dims()));
+    if (x.dtype() == phi::DataType::FLOAT16) {
+      x_grad_tmp = cast<T>(x_grad_tmp, x.dtype());
+    }
+    set_output<T>(x_grad_tmp, x_grad);
+  }
+}
+
 template <typename T>
 void cumsum_grad(const Tensor& x,
                  const Tensor& out_grad,
diff --git a/paddle/phi/api/yaml/legacy_backward.yaml b/paddle/phi/api/yaml/legacy_backward.yaml
@@ -629,6 +629,7 @@
   kernel :
     func : layer_norm_grad
     data_type : out_grad
+  composite : layer_norm_grad(x, scale, bias, mean,varience, out_grad, epsilon, begin_norm_axis, x_grad, scale_grad, bias_grad)
   no_need_buffer : bias
   optional : scale, bias
 
diff --git a/paddle/phi/infermeta/ternary.cc b/paddle/phi/infermeta/ternary.cc
@@ -574,14 +574,23 @@ void LayerNormInferMeta(const MetaTensor& x,
             right));
   }
 
+  phi::DataType x_dtype = x.dtype();
   out->set_dims(x_dim);
+  out->set_dtype(x_dtype);
+  out->share_lod(x);
+
+  phi::DataType param_type =
+      (x_dtype == phi::DataType::BFLOAT16 || x_dtype == phi::DataType::FLOAT16)
+          ? phi::DataType::FLOAT32
+          : x_dtype;
   if (mean) {
     mean->set_dims({left});
+    mean->set_dtype(param_type);
   }
   if (variance) {
     variance->set_dims({left});
+    variance->set_dtype(param_type);
   }
-  out->share_lod(x);
 }
 
 void LayerNormGradInferMeta(const MetaTensor& x,
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py
@@ -237,10 +237,12 @@ def test_train(self):
 
     def test_train_composite(self):
         core._set_prim_backward_enabled(True)
+        # core._add_skip_comp_ops("layer_norm")
         static_loss, static_ppl = self.train_static(
             self.bert_config, self.data_reader
         )
         core._set_prim_backward_enabled(False)
+        # core._add_skip_comp_ops("layer_norm")
         dygraph_loss, dygraph_ppl = self.train_dygraph(
             self.bert_config, self.data_reader
         )
diff --git a/python/paddle/fluid/tests/unittests/prim/composite_ops/test_composite_layer_norm_grad.py b/python/paddle/fluid/tests/unittests/prim/composite_ops/test_composite_layer_norm_grad.py
diff --git a/python/paddle/fluid/tests/unittests/prim/model/test_bert_prim_cinn.py b/python/paddle/fluid/tests/unittests/prim/model/test_bert_prim_cinn.py
diff --git a/python/paddle/incubate/autograd/composite_rules.py b/python/paddle/incubate/autograd/composite_rules.py

Original file line number	Diff line number	Diff line change
`@@ -574,14 +574,23 @@ void LayerNormInferMeta(const MetaTensor& x,`
`574`	`574`	`right));`
`575`	`575`	`}`
`576`	`576`
	`577`	`+ phi::DataType x_dtype = x.dtype();`
`577`	`578`	`out->set_dims(x_dim);`
	`579`	`+ out->set_dtype(x_dtype);`
	`580`	`+ out->share_lod(x);`
	`581`	`+`
	`582`	`+ phi::DataType param_type =`
	`583`	`+ (x_dtype == phi::DataType::BFLOAT16 \|\| x_dtype == phi::DataType::FLOAT16)`
	`584`	`+ ? phi::DataType::FLOAT32`
	`585`	`+ : x_dtype;`
`578`	`586`	`if (mean) {`
`579`	`587`	`mean->set_dims({left});`
	`588`	`+ mean->set_dtype(param_type);`
`580`	`589`	`}`
`581`	`590`	`if (variance) {`
`582`	`591`	`variance->set_dims({left});`
	`592`	`+ variance->set_dtype(param_type);`
`583`	`593`	`}`
`584`		`- out->share_lod(x);`
`585`	`594`	`}`
`586`	`595`
`587`	`596`	`void LayerNormGradInferMeta(const MetaTensor& x,`
Original file line number	Diff line number	Diff line change
`@@ -237,10 +237,12 @@ def test_train(self):`
`237`	`237`
`238`	`238`	`def test_train_composite(self):`
`239`	`239`	`core._set_prim_backward_enabled(True)`
	`240`	`+ # core._add_skip_comp_ops("layer_norm")`
`240`	`241`	`static_loss, static_ppl = self.train_static(`
`241`	`242`	`self.bert_config, self.data_reader`
`242`	`243`	`)`
`243`	`244`	`core._set_prim_backward_enabled(False)`
	`245`	`+ # core._add_skip_comp_ops("layer_norm")`
`244`	`246`	`dygraph_loss, dygraph_ppl = self.train_dygraph(`
`245`	`247`	`self.bert_config, self.data_reader`
`246`	`248`	`)`