Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Improve pooling #5519

Merged
merged 16 commits into from
Mar 22, 2017
Merged

Improve pooling #5519

merged 16 commits into from
Mar 22, 2017

Conversation

reminisce
Copy link
Contributor

This pull request re-implemented pooling operator to support 1/2/3-D pooling operations without using expression template to increase performance.

Pooling: This implementation.
PoolingV1: Currently existing operator in MXNet.
CuDNNPooling: cuDNN pooling operator.

Table 1: max pooling benchmark results. New implementation’s 1D max pooling is compared with the equivalent 2D case.

1D: data=(10, 3, 100000), kernel=(64,)
2D: data=(10, 3, 100000, 1), kernel=(64, 1)
CPU (ms) GPU (ms)
Pooling(1D) 251 7
Pooling(2D) 680 14
CuDNNPooling(2D) N/A 12
PoolingV1(2D) 4428 36

Table 2: max pooling 2D benchmark results.

2D: data=(10, 3, 100, 100), kernel=(8, 8) CPU (ms) GPU (ms)
Pooling 144 4
CuDNNPooling N/A 5
PoolingV1 639 9

Table 3: max pooling 3D benchmark results.

3D: data=(10, 3, 100, 100, 100), kernel=(8, 8, 8) CPU (ms) GPU (ms)
Pooling 2509 53
CuDNNPooling N/A 42

Table 4: avg pooling benchmark results. New implementation’s 1D avg pooling is compared with the equivalent 2D case.

1D: data=(10, 3, 100000), kernel=(64,)
2D: data=(10, 3, 100000, 1), kernel=(64, 1)
CPU (ms) GPU (ms)
Pooling(1D) 227 13
Pooling(2D) 938 17
CuDNNPooling(2D) N/A 10
PoolingV1(2D) 2998 26

Table 5: avg pooling 2D benchmark results.

2D: data=(10, 3, 100, 100), kernel=(8, 8) CPU (ms) GPU (ms)
Pooling 123 6
CuDNNPooling N/A 7
PoolingV1 412 8

Table 6: avg pooling 3D benchmark results.

3D: data=(10, 3, 100, 100, 100), kernel=(8, 8, 8) CPU (ms) GPU (ms)
Pooling 2301 85
CuDNNPooling N/A 60

const TShape& stride, const int pool_type, OpReqType req_type,
DType* out_data) {
using namespace mxnet_op;
if (kernel.ndim() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little to much copy pasting.
how about change unpool_sum_?d_gpu_kernel to unpool_sum_gpu_kernel(..., Shape<?>)
and add int ndim to pool's template argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ignore the previous comment. This is fine

sum += out_slice[h * width + w];
}
}
KERNEL_ASSIGN(out_data[index], req_type, sum / pool_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forward doesn't need req_type support for now. Its almost always writeto.
add a check in pool and remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@piiswrong
Copy link
Contributor

Please submit another PR to move conv and pooling to src/operator/nn/ after this

@piiswrong
Copy link
Contributor

Any idea why 2d is that much slower than 1d?

@reminisce
Copy link
Contributor Author

reminisce commented Mar 21, 2017

For GPU 1/2-D kernels, one possible reason might be this line and the following "atomicAdd" function call.
https://github.com/dmlc/mxnet/pull/5519/files#diff-b6c5be2d23abce9c69031e26504e7c4fR387

The "if" condition would cause thread divergence in the 2D cases and consequently, more runtime than 1D.

The unpool_sum has no such condition check for 2/3 d kernels, and the relative difference between 1D and 2D is not that big.

But if we could have a mask as an invisible output, the unpool_max should be faster and have no atomicAdd calls.

@piiswrong
Copy link
Contributor

still failing. not sure if its mkl bug or the new pooling op

@reminisce
Copy link
Contributor Author

I'm debugging to find the root cause.

@piiswrong
Copy link
Contributor

@glingyan @zhenlinluo Looks like there is a bug with mkldnn pooling op when pad > 0

@glingyan
Copy link
Contributor

will looking it today

@piiswrong
Copy link
Contributor

We will disable mkl for pad > 0 and merge this first. please re-enable when you submit a fix

@glingyan
Copy link
Contributor

@piiswrong how do you do the test, unit test?

@piiswrong
Copy link
Contributor

consistency tests agains cudnn

@reminisce
Copy link
Contributor Author

@glingyan You can see the test log here. Just run test_pooling_versions() in test_operator_gpu.py.
The failing case is
pool_type='avg'
pooling_convention='full',
data = (2, 3, 20, 20)
kernel = (4, 5)
pad=(2, 3)
stride=(2, 3)

http://ec2-52-25-96-65.us-west-2.compute.amazonaws.com/blue/organizations/jenkins/mxnet/detail/PR-5519/5/pipeline#step-246-log-362

@sxjscience
Copy link
Member

@piiswrong @reminisce Do we have any idea to improve also the deconvolution?

@glingyan
Copy link
Contributor

@piiswrong @reminisce @sxjscience deconvolution could be implment with mkl convolution api ,

@glingyan
Copy link
Contributor

@reminisce @piiswrong will try to fix pooling when your patch is merged

@sxjscience
Copy link
Member

I find the current deconvolution does not support 3D Deconv (https://github.com/torch/nn/blob/master/doc/convolution.md#nn.VolumetricFullConvolution), which has been used for video generation.

@reminisce
Copy link
Contributor Author

reminisce commented Mar 22, 2017

@sxjscience I think we can do similar things as we did for convolution by adopting Caffe's algorithm. The key functions im2col and col2im have been implemented for convolution. We can use them for deconv as well.

@sxjscience
Copy link
Member

@reminisce Yes, if that's not in the plan I can ask Leo @leezu to help do the job

@reminisce
Copy link
Contributor Author

@piiswrong What do you think about @sxjscience 's proposal? If it has higher priority, I can do it.

@piiswrong
Copy link
Contributor

piiswrong commented Mar 22, 2017

@reminisce Someone should do it... Shouldn't be too hard
make sure you handle bias correctly

@reminisce
Copy link
Contributor Author

@sxjscience If you have resources to work on deconvolution, you can lead it. Otherwise, I will spare time on this in the next a couple of weeks.

BTW: The deconvolution implemented is actually a transposed convolution. Tensorflow has renamed it to conv2d_transpose. Should we take the chance to make the name correct?
Refs:
tensorflow/tensorflow#256 (comment)
https://github.com/vdumoulin/conv_arithmetic

@piiswrong piiswrong merged commit 181cddd into apache:master Mar 22, 2017
@sxjscience
Copy link
Member

I think we can add an alias to the operator.

@glingyan
Copy link
Contributor

@piiswrong @reminisce

this problem is because when param.pooling_convention == pool_enum::kFull, the mkl's algorithm is the same with CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING

so the correct way to by pass is
#if MXNET_USE_MKL2017 == 1
if (param.kernel.ndim() == 2
&& (param.pooling_convention == pool_enum::kValid)
&& ((param.pool_type == pool_enum::kMaxPooling)
|| (param.pool_type == pool_enum::kAvgPooling))) {
switch (dtype) {
case mshadow::kFloat32:
return new MKLPoolingOp<cpu, float>(param);
case mshadow::kFloat64:
return new MKLPoolingOp<cpu, double>(param);
default:
break;
}
}

@sxjscience
Copy link
Member

@reminisce We decide to first add the CUDNN version of 3D Deconv.

@reminisce
Copy link
Contributor Author

@glingyan Thanks for finding the root cause. Could you submit a PR to fix it?

@reminisce reminisce deleted the improve_pooling branch March 22, 2017 17:10
@piiswrong
Copy link
Contributor

@glingyan to be clear, you mean its a feature unsupported by mkldnn and need to be bypassed?

@glingyan
Copy link
Contributor

@piiswrong, current mkl2017 algorithm for pool_enum::kFull is designed for EXCLUDE_PADDING, if change to INCLUDE_PADDING will have segment fault
already report to mkl team
I will test mkl-dnn these days

@reminisce
Copy link
Contributor Author

@sxjscience I have some time in the following weeks. I plant to port Caffe's algorithm of deconv to MXNet as a low priority task, which means no deadline promise yet. I can merge it with your cuDNN 3D deconv work after I finish the implementation. Let me know if there are any conflicts. Thanks.

@piiswrong
Copy link
Contributor

piiswrong commented Mar 27, 2017

@reminisce @sxjscience Please try to reuse code instead of copy pasting from conv for deconv

@@ -85,6 +85,10 @@ DMLC_REGISTER_PARAMETER(PoolingParam);
MXNET_REGISTER_OP_PROPERTY(Pooling, PoolingProp)
.describe(R"code(Perform pooling on the input.

The shapes for 1-D pooling are
- **data**: *(batch_size, channel, width)*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line before -, otherwise sphnix fails to recognize it as a list, see http://mxnet.io/api/python/symbol.html#mxnet.symbol.Pooling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also remove the content: 1-D pooling is special case of 2-D pooling with width=1 and kernel[1]=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it in another PR. Thanks for pointing it out.

@leezu
Copy link
Contributor

leezu commented Mar 29, 2017

@piiswrong @reminisce an initial cudnn 3D deconv implementation is in #5615. Currently only the algoreg is reused, but if you have further suggestions we can refactor the code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants