[pnorm] optimize p_norm for special cases #37685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ZHUI merged 15 commits into PaddlePaddle:develop from LemonNoel:develop

Dec 13, 2021

Contributor

LemonNoel commented Nov 29, 2021 •

edited

Loading

PR types

Performance optimization

PR changes

OPs

Describe

Optimize p_norm for two kinds of special cases:
(1) shape=[2, 1000, 1000], reduce axis=0
(2) shape=[1, 2000000, 1], reduce axis=1

The original version is paddlepaddle-gpu == 2.2.1. The Time denotes seconds per 1k steps.

Forward Cases

Tensor Shape	OP	Original Time	Current Time	Speedup
[2, 1k, 1k]	paddle.norm(x, axis=0, p=3)	4.0680373	0.0948312	45.2
[1, 2m, 1]	paddle.norm(x, axis=1, p=3)	2.5773182	0.0548460	51.6
[1m, 2]	paddle.norm(x, axis=1, p=3)	4.0455515	0.0652089	57.9
[1k, 1k]	paddle.norm(x, axis=0, p=3)	0.0214734	0.0451496	-

Backward Cases

Shape	OP	Original Time	Current Time	Speedup
[2, 1k, 1k]	paddle.norm(x, axis=0, p=3).sum().backward()	4.6668179	0.3521516	13.3
[1, 2m, 1]	paddle.norm(x, axis=1, p=3).sum().backward()	4.8750155	0.2606518	18.8
[1m, 2]	paddle.norm(x, axis=1, p=3).sum().backward()	0.0756586	0.0661562	1.1
[1k, 1k]	paddle.norm(x, axis=0, p=3).sum().backward()	0.0624404	0.0668254	-

LemonNoel added 3 commits

November 24, 2021 11:21


          [p_norm] optimize zeronorm, infnorm and neginfonorm

b23e36b


          [pnorm] optimize pnorm for special case, a preliminary attempt

7ee4116


          Merge remote-tracking branch 'lemon/develop' into develop

58e34d0

paddle-bot-old bot commented Nov 29, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

LemonNoel marked this pull request as ready for review

November 29, 2021 15:04

LemonNoel marked this pull request as draft

November 29, 2021 15:05

LemonNoel marked this pull request as ready for review

November 29, 2021 15:05

LemonNoel marked this pull request as draft

November 29, 2021 15:06

LemonNoel added 6 commits

November 30, 2021 12:06


          [pnorm] optimize p_norm for special cases with Kernel Primitive API

5cd7469


          Merge branch 'PaddlePaddle:develop' into develop

9a7abe8


          [pnorm] optimize backward speed for special cases, and modify reduce_…

cc21bcf

…op for flexible call.


          Merge remote-tracking branch 'origin/develop' into develop

3a874cc


          Merge branch 'PaddlePaddle:develop' into develop

cac59be


          Merge remote-tracking branch 'lemon/develop' into develop

d01e933

PaddlePaddle locked and limited conversation to collaborators

PaddlePaddle unlocked this conversation

ZHUI marked this pull request as ready for review

December 6, 2021 13:49

ZHUI requested review from AnnaTrainingG, ZHUI and wawltor

December 6, 2021 13:49

LemonNoel added 4 commits

December 7, 2021 05:22


          [pnorm] remove reduce_function_op import


          Merge remote-tracking branch 'origin/develop' into develop

7741bf9


          [pnorm] remove p_norm_op.cc from unity group

d058848


          [pnorm] remove p_norm_op.cu from unity group

efbb361

Contributor Author

LemonNoel commented Dec 8, 2021

@Avin0323 你好，辛苦看下benchmark的CI。这个 PR 改动了 p_norm 的代码，优化了特殊 shape 的性能。从 CI 结果看 p_norm 的测试耗时都降低了。另外，还修改了 cmake 文件，辛苦 review 一下。

ZHUI previously approved these changes

View reviewed changes

paddle/fluid/operators/p_norm_op.cu

+                HOSTDEVICE explicit inline NonzeroFunctor(int n) {}
+                template <typename T>
+                HOSTDEVICE inline T operator()(const T& x) const {
+                  return static_cast<T>(static_cast<double>(x) != 0);

Collaborator

ZHUI Dec 8, 2021

为什么先 static_cast<double>(x) cast 为double？

Contributor Author

LemonNoel Dec 8, 2021

这里保留了原始实现

paddle/fluid/operators/p_norm_op.cu Outdated

                   auto xdim = in_x->dims();
                   auto ndim = out_norm->dims();
                   float porder = ctx.Attr<float>("porder");
                   int axis = ctx.Attr<int>("axis");
                   bool asvector = ctx.Attr<bool>("asvector");
                   if (axis < 0) axis = xdim.size() + axis;
-                  int pre, n, post;
-                  GetDims(xdim, axis, &pre, &n, &post, asvector);
+                  std::vector<int> reduce_axis = {axis};
                   auto& dev_ctx = ctx.cuda_device_context();

Collaborator

ZHUI Dec 8, 2021

dev_ctx no usage in the function?

Contributor Author

LemonNoel Dec 8, 2021

是的，晚点提PR删掉

paddle/fluid/operators/p_norm_op.cu

                   } else {
-                    Pnorm<T, block><<<grid, block, 0, dev_ctx.stream()>>>(x, pre, n, post,
-                                                                          porder, norm);
+                    framework::Tensor tmp_x;

Collaborator

ZHUI Dec 8, 2021

记一下todo，这里的 tmp_x 需要尽早去掉。运行时显存占用提升很多。

Contributor Author

LemonNoel Dec 8, 2021

TODO 好的，已提交卡片

paddle/fluid/operators/p_norm_op.cu

+                  auto negs = dx->constant(static_cast<T>(-1.));
+                  auto zeros = dx->constant(static_cast<T>(0.));
+                  auto positives = (*x) > zeros;
+                  dx->device(place) = dy->broadcast(dim) * equals.select(ones, zeros) *

Collaborator

ZHUI Dec 8, 2021

这里反向，都是走的eigen？

Contributor Author

LemonNoel Dec 9, 2021

计算是eigen tensor

paddle/fluid/operators/p_norm_op.cu

                   float porder = ctx.Attr<float>("porder");
                   T eps = static_cast<T>(ctx.Attr<float>("epsilon"));
                   int axis = ctx.Attr<int>("axis");
+                  bool reduce_all = ((axis < 0) || (in_norm->numel() == 1));

Collaborator

ZHUI Dec 8, 2021

axis < 0 是对应 reduce_all 吗？

Contributor Author

LemonNoel Dec 8, 2021

是的

paddle/fluid/operators/p_norm_op.cu Outdated

                   bool asvector = ctx.Attr<bool>("asvector");
                   if (axis < 0) axis = xdim.size() + axis;
-                  int pre, n, post;
-                  GetDims(xdim, axis, &pre, &n, &post, asvector);
+                  const std::vector<int> dims = {axis};
                   auto& dev_ctx = ctx.cuda_device_context();

Collaborator

ZHUI Dec 8, 2021

dev_ctx 是否还有使用？

Contributor Author

LemonNoel Dec 9, 2021

已删除

LemonNoel added 2 commits

December 9, 2021 11:28


          Merge remote-tracking branch 'origin/develop' into develop

059f42d


          [pnorm] remove unused variables

1cb511d

LemonNoel dismissed ZHUI’s stale review via

1cb511d

December 9, 2021 12:27

Avin0323 approved these changes

View reviewed changes

Contributor

Avin0323 left a comment

LGTM for PR-CI-OP-benchmark and changes of unity_build_rule.cmake

ZHUI approved these changes

View reviewed changes

Collaborator

ZHUI left a comment

LGTM

ZHUI merged commit 10d9ab4 into PaddlePaddle:develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet