Speed up elemwise grad #8402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

tonyyang-svail merged 4 commits into PaddlePaddle:develop from reyoung:feature/fix_elemwise_grad

Feb 22, 2018

Collaborator

reyoung commented Feb 12, 2018 •

edited

Loading


          Speed up elemwise grad

reyoung changed the title ~~Speed up elemwise grad~~ [WIP] Speed up elemwise grad

reyoung requested review from chengduoZH, dzhwinter and tonyyang-svail

February 12, 2018 07:42

chengduoZH reviewed

View reviewed changes

Contributor

chengduoZH left a comment

I thought that elementwise_add_grad can be implemented by matrix multiplication. Maybe it will be faster.

paddle/fluid/operators/elementwise_add_op.h

-                  }
-                }
+              struct IdentityGrad {
+                HOSTDEVICE T operator()(T x, T y, T out, T dout) const { return dout; }

Contributor

chengduoZH Feb 12, 2018

Add inline

Collaborator Author

reyoung Feb 22, 2018

Actually, inline is decided by the compiler.

paddle/fluid/operators/elementwise_op_function.h

+                do {
+                  int x_offset = i * w + j;
+                  if (dx) {
+                    dx[x_offset] = dx_op(x[x_offset], y[j], out[x_offset], dout[x_offset]);

Contributor

chengduoZH Feb 22, 2018

I wonder whether this will be faster than before. For elementwise_add_grad, the computation of dx only use dout, but line 374 will cause the data(x,y,out) which is useless to be transferred from the graphics memory to the register.

Collaborator Author

reyoung Feb 22, 2018

nvcc can optimize the memory access if the functor is not using the variable.

I just check this by reading the generated PTX file.

tonyyang-svail Feb 22, 2018

I just check this by reading the generated PTX file.

cool...

paddle/fluid/operators/elementwise_op_function.h Outdated

+                    shm[tid] += dy_op(x[x_offset], y[j], out[x_offset], dout[x_offset]);
+                  }
+                  i += 1024;
+                } while (i < h);

Contributor

chengduoZH Feb 22, 2018

line 378~380 is confusing.
It seems 1024 should be blockDim.x.

reyoung added 3 commits

February 22, 2018 14:33


          Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

97d2b25

… feature/fix_elemwise_grad


          Fix bug

ff3c897


          Add macro for MAX_BLOCK_DIM

cad4d76

chengduoZH reviewed

View reviewed changes

paddle/fluid/operators/elementwise_op_function.h

+                shm[tid] = 0;
+                do {
+                  int x_offset = i * w + j;

Contributor

chengduoZH Feb 22, 2018

The data(x, dx, dout) access is not continuous. This may have an impact on Performance.

Collaborator Author

reyoung Feb 22, 2018 •

edited

Loading

Indeed. However, this will make the reduction easier. There could be a more effective implementation.

paddle/fluid/operators/elementwise_op_function.h

+                while (true) {
+                  int i = ttid / post;
+                  int k = ttid % post;

Contributor

chengduoZH Feb 22, 2018

The division is very time consuming, it is recommended to multiply.

float inv_post = 1.0/post;
while(true){
  int i = ttid * inv_post;
  int k = ttid - i * post;
  ...

Collaborator Author

reyoung Feb 22, 2018

I am not sure which implementation is faster, the multiplication between float values or the division between integers. However, it should not cost too much time these lines since it is not the main logic of the method.

paddle/fluid/operators/elementwise_op_function.h

+                  int k = ttid % post;
+                  if (i >= pre) break;
+                  int x_offset = i * n * post + j * post + k;

Contributor

chengduoZH Feb 22, 2018

int x_offset = i * n * post + j * post + k;

==>

int x_offset = (i * n + j) * post + k;

Collaborator Author

reyoung Feb 22, 2018

The compiler should optimize this equation.

reyoung changed the title ~~[WIP] Speed up elemwise grad~~ Speed up elemwise grad

reyoung force-pushed the feature/fix_elemwise_grad branch from 7e3ae7e to cad4d76 Compare

February 22, 2018 09:10

tonyyang-svail commented Feb 22, 2018 •

edited

Loading

Some of my benchmark results.

Before

After

tonyyang-svail approved these changes

View reviewed changes

tonyyang-svail left a comment

Great work. In my testing benchmark, the time percental of elementwise grad goes from 85.8% to 2.3%.

The alternative solution suggested by @chengduoZH is very valuable. But I would suggest merging this PR first. We can make another PR if later on this 2.3% becomes the bottleneck. 👍

tonyyang-svail merged commit 88c22e9 into PaddlePaddle:develop

dzhwinter mentioned this pull request

"accelerate elementwise_add_grad, add reduce functor" #7961

Closed

chengduoZH mentioned this pull request

[Speed] Refine elementwise sub,div,min,max gradient functor #8820

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet