Improve pooling #5519

reminisce · 2017-03-21T18:14:51Z

This pull request re-implemented pooling operator to support 1/2/3-D pooling operations without using expression template to increase performance.

Pooling: This implementation.
PoolingV1: Currently existing operator in MXNet.
CuDNNPooling: cuDNN pooling operator.

Table 1: max pooling benchmark results. New implementation’s 1D max pooling is compared with the equivalent 2D case.

1D: data=(10, 3, 100000), kernel=(64,) 2D: data=(10, 3, 100000, 1), kernel=(64, 1)	CPU (ms)	GPU (ms)
Pooling(1D)	251	7
Pooling(2D)	680	14
CuDNNPooling(2D)	N/A	12
PoolingV1(2D)	4428	36

Table 2: max pooling 2D benchmark results.

2D: data=(10, 3, 100, 100), kernel=(8, 8)	CPU (ms)	GPU (ms)
Pooling	144	4
CuDNNPooling	N/A	5
PoolingV1	639	9

Table 3: max pooling 3D benchmark results.

3D: data=(10, 3, 100, 100, 100), kernel=(8, 8, 8)	CPU (ms)	GPU (ms)
Pooling	2509	53
CuDNNPooling	N/A	42

Table 4: avg pooling benchmark results. New implementation’s 1D avg pooling is compared with the equivalent 2D case.

1D: data=(10, 3, 100000), kernel=(64,) 2D: data=(10, 3, 100000, 1), kernel=(64, 1)	CPU (ms)	GPU (ms)
Pooling(1D)	227	13
Pooling(2D)	938	17
CuDNNPooling(2D)	N/A	10
PoolingV1(2D)	2998	26

Table 5: avg pooling 2D benchmark results.

2D: data=(10, 3, 100, 100), kernel=(8, 8)	CPU (ms)	GPU (ms)
Pooling	123	6
CuDNNPooling	N/A	7
PoolingV1	412	8

Table 6: avg pooling 3D benchmark results.

3D: data=(10, 3, 100, 100, 100), kernel=(8, 8, 8)	CPU (ms)	GPU (ms)
Pooling	2301	85
CuDNNPooling	N/A	60

piiswrong · 2017-03-21T18:19:49Z

src/operator/nn/pool.cuh

+                 const TShape& stride, const int pool_type, OpReqType req_type,
+                 DType* out_data) {
+  using namespace mxnet_op;
+  if (kernel.ndim() == 1) {


this is a little to much copy pasting.
how about change unpool_sum_?d_gpu_kernel to unpool_sum_gpu_kernel(..., Shape<?>)
and add int ndim to pool's template argument

actually ignore the previous comment. This is fine

piiswrong · 2017-03-21T18:25:23Z

src/operator/nn/pool.cuh

+		    sum += out_slice[h * width + w];
+		  }
+	  }
+	  KERNEL_ASSIGN(out_data[index], req_type, sum / pool_size);


forward doesn't need req_type support for now. Its almost always writeto.
add a check in pool and remove this

piiswrong · 2017-03-21T18:34:41Z

Please submit another PR to move conv and pooling to src/operator/nn/ after this

piiswrong · 2017-03-21T20:32:02Z

Any idea why 2d is that much slower than 1d?

reminisce · 2017-03-21T21:00:10Z

For GPU 1/2-D kernels, one possible reason might be this line and the following "atomicAdd" function call.
https://github.com/dmlc/mxnet/pull/5519/files#diff-b6c5be2d23abce9c69031e26504e7c4fR387

The "if" condition would cause thread divergence in the 2D cases and consequently, more runtime than 1D.

The unpool_sum has no such condition check for 2/3 d kernels, and the relative difference between 1D and 2D is not that big.

But if we could have a mask as an invisible output, the unpool_max should be faster and have no atomicAdd calls.

piiswrong · 2017-03-22T02:49:25Z

still failing. not sure if its mkl bug or the new pooling op

reminisce · 2017-03-22T03:10:20Z

I'm debugging to find the root cause.

piiswrong · 2017-03-22T03:54:24Z

@glingyan @zhenlinluo Looks like there is a bug with mkldnn pooling op when pad > 0

glingyan · 2017-03-22T03:56:22Z

will looking it today

piiswrong · 2017-03-22T03:58:49Z

We will disable mkl for pad > 0 and merge this first. please re-enable when you submit a fix

glingyan · 2017-03-22T03:59:49Z

@piiswrong how do you do the test, unit test?

piiswrong · 2017-03-22T04:02:19Z

consistency tests agains cudnn

reminisce · 2017-03-22T04:04:13Z

@glingyan You can see the test log here. Just run test_pooling_versions() in test_operator_gpu.py.
The failing case is
pool_type='avg'
pooling_convention='full',
data = (2, 3, 20, 20)
kernel = (4, 5)
pad=(2, 3)
stride=(2, 3)

http://ec2-52-25-96-65.us-west-2.compute.amazonaws.com/blue/organizations/jenkins/mxnet/detail/PR-5519/5/pipeline#step-246-log-362

sxjscience · 2017-03-22T04:05:10Z

@piiswrong @reminisce Do we have any idea to improve also the deconvolution?

glingyan · 2017-03-22T04:06:48Z

@piiswrong @reminisce @sxjscience deconvolution could be implment with mkl convolution api ,

glingyan · 2017-03-22T04:07:53Z

@reminisce @piiswrong will try to fix pooling when your patch is merged

sxjscience · 2017-03-22T04:12:07Z

I find the current deconvolution does not support 3D Deconv (https://github.com/torch/nn/blob/master/doc/convolution.md#nn.VolumetricFullConvolution), which has been used for video generation.

reminisce · 2017-03-22T04:13:23Z

@sxjscience I think we can do similar things as we did for convolution by adopting Caffe's algorithm. The key functions im2col and col2im have been implemented for convolution. We can use them for deconv as well.

sxjscience · 2017-03-22T04:16:13Z

@reminisce Yes, if that's not in the plan I can ask Leo @leezu to help do the job

reminisce · 2017-03-22T04:20:28Z

@piiswrong What do you think about @sxjscience 's proposal? If it has higher priority, I can do it.

piiswrong · 2017-03-22T05:41:28Z

@reminisce Someone should do it... Shouldn't be too hard
make sure you handle bias correctly

reminisce · 2017-03-22T05:57:44Z

@sxjscience If you have resources to work on deconvolution, you can lead it. Otherwise, I will spare time on this in the next a couple of weeks.

BTW: The deconvolution implemented is actually a transposed convolution. Tensorflow has renamed it to conv2d_transpose. Should we take the chance to make the name correct?
Refs:
tensorflow/tensorflow#256 (comment)
https://github.com/vdumoulin/conv_arithmetic

sxjscience · 2017-03-22T06:02:55Z

I think we can add an alias to the operator.

glingyan · 2017-03-22T13:18:56Z

@piiswrong @reminisce

this problem is because when param.pooling_convention == pool_enum::kFull, the mkl's algorithm is the same with CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING

so the correct way to by pass is
#if MXNET_USE_MKL2017 == 1
if (param.kernel.ndim() == 2
&& (param.pooling_convention == pool_enum::kValid)
&& ((param.pool_type == pool_enum::kMaxPooling)
|| (param.pool_type == pool_enum::kAvgPooling))) {
switch (dtype) {
case mshadow::kFloat32:
return new MKLPoolingOp<cpu, float>(param);
case mshadow::kFloat64:
return new MKLPoolingOp<cpu, double>(param);
default:
break;
}
}

sxjscience · 2017-03-22T13:31:14Z

@reminisce We decide to first add the CUDNN version of 3D Deconv.

reminisce · 2017-03-22T17:06:25Z

@glingyan Thanks for finding the root cause. Could you submit a PR to fix it?

piiswrong · 2017-03-22T17:53:27Z

@glingyan to be clear, you mean its a feature unsupported by mkldnn and need to be bypassed?

glingyan · 2017-03-22T23:26:17Z

@piiswrong, current mkl2017 algorithm for pool_enum::kFull is designed for EXCLUDE_PADDING, if change to INCLUDE_PADDING will have segment fault
already report to mkl team
I will test mkl-dnn these days

reminisce · 2017-03-27T20:29:06Z

@sxjscience I have some time in the following weeks. I plant to port Caffe's algorithm of deconv to MXNet as a low priority task, which means no deadline promise yet. I can merge it with your cuDNN 3D deconv work after I finish the implementation. Let me know if there are any conflicts. Thanks.

piiswrong · 2017-03-27T21:07:04Z

@reminisce @sxjscience Please try to reuse code instead of copy pasting from conv for deconv

mli · 2017-03-27T21:32:57Z

src/operator/pooling.cc

@@ -85,6 +85,10 @@ DMLC_REGISTER_PARAMETER(PoolingParam);
 MXNET_REGISTER_OP_PROPERTY(Pooling, PoolingProp)
 .describe(R"code(Perform pooling on the input.

+The shapes for 1-D pooling are
+- **data**: *(batch_size, channel, width)*,


add a blank line before -, otherwise sphnix fails to recognize it as a list, see http://mxnet.io/api/python/symbol.html#mxnet.symbol.Pooling

also remove the content: 1-D pooling is special case of 2-D pooling with width=1 and kernel[1]=1.

I will change it in another PR. Thanks for pointing it out.

leezu · 2017-03-29T06:11:09Z

@piiswrong @reminisce an initial cudnn 3D deconv implementation is in #5615. Currently only the algoreg is reused, but if you have further suggestions we can refactor the code.

piiswrong reviewed Mar 21, 2017

View reviewed changes

reminisce force-pushed the improve_pooling branch from c86014d to baaaad4 Compare March 21, 2017 20:04

reminisce force-pushed the improve_pooling branch from e53eac9 to 2431700 Compare March 22, 2017 01:02

reminisce added 9 commits March 21, 2017 21:19

Initial check-in of new pooling op

66e4f16

Implemented 2d pooling cpu and gpu

966780c

Add 2d avg and sum pooling cpu and gpu

8776bec

Fix lint

4678368

Add pooling 3d cpu and bug fix

01749a7

Added 3d max pooling gpu

3578b2f

Added 3d avg/sum pooling gpu

9302a0e

Added max pooling 1d cpu and gpu

aaac667

Added 1d avg/sum pooling cpu and gpu

076b60f

reminisce added 7 commits March 21, 2017 21:19

Added pooling test

1e162a1

Added description and Caffe copyright notice

f7b8600

Added comments for pool and unpool functions

cefe72d

Fixed lint

7e041a1

Fix MKL test

32a888b

Fix mkl

e311dee

Disable MKL for pad and stride > 0

c65cbdf

reminisce force-pushed the improve_pooling branch from 6f5d037 to c65cbdf Compare March 22, 2017 04:21

piiswrong merged commit 181cddd into apache:master Mar 22, 2017

reminisce deleted the improve_pooling branch March 22, 2017 17:10

mli reviewed Mar 27, 2017

View reviewed changes

haojin2 mentioned this pull request Jun 9, 2018

[MXNET-380] count_include_pad argument for Avg Pooling #11021

Merged

6 tasks

Improve pooling #5519

Improve pooling #5519

Conversation

reminisce commented Mar 21, 2017

piiswrong Mar 21, 2017

Choose a reason for hiding this comment

piiswrong Mar 21, 2017

Choose a reason for hiding this comment

piiswrong Mar 21, 2017

Choose a reason for hiding this comment

reminisce Mar 21, 2017

Choose a reason for hiding this comment

piiswrong commented Mar 21, 2017

piiswrong commented Mar 21, 2017

reminisce commented Mar 21, 2017 • edited Loading

piiswrong commented Mar 22, 2017

reminisce commented Mar 22, 2017

piiswrong commented Mar 22, 2017

glingyan commented Mar 22, 2017

piiswrong commented Mar 22, 2017

glingyan commented Mar 22, 2017

piiswrong commented Mar 22, 2017

reminisce commented Mar 22, 2017

sxjscience commented Mar 22, 2017

glingyan commented Mar 22, 2017

glingyan commented Mar 22, 2017

sxjscience commented Mar 22, 2017

reminisce commented Mar 22, 2017 • edited Loading

sxjscience commented Mar 22, 2017

reminisce commented Mar 22, 2017

piiswrong commented Mar 22, 2017 • edited Loading

reminisce commented Mar 22, 2017

sxjscience commented Mar 22, 2017

glingyan commented Mar 22, 2017

sxjscience commented Mar 22, 2017

reminisce commented Mar 22, 2017

piiswrong commented Mar 22, 2017

glingyan commented Mar 22, 2017

reminisce commented Mar 27, 2017

piiswrong commented Mar 27, 2017 • edited Loading

mli Mar 27, 2017

Choose a reason for hiding this comment

mli Mar 27, 2017

Choose a reason for hiding this comment

reminisce Mar 27, 2017

Choose a reason for hiding this comment

leezu commented Mar 29, 2017

reminisce commented Mar 21, 2017 •

edited

Loading

reminisce commented Mar 22, 2017 •

edited

Loading

piiswrong commented Mar 22, 2017 •

edited

Loading

piiswrong commented Mar 27, 2017 •

edited

Loading