Skip to content

Conversation

@hedaoyuan
Copy link
Contributor

@hedaoyuan hedaoyuan commented Aug 28, 2017

This depthwise convolution optimization is discussed with @NHZlX , and is based on the ARM NEON instruction set, also can be extended to X86 SSE and AVX instruction set.

The optimized logic is if the output size is greater than 4 than each step calculates the four elements of the output.
For example, convolution filter is 3x3:
Use 9 instructions to calculate four elements of the output:

Output[0, 1, 2, 3]   = R0[0, 1, 2, 3] * K0[0]
Output[0, 1, 2, 3] += R0[1, 2, 3, 4] * K0[1]
Output[0, 1, 2, 3] += R0[2, 3, 4, 5] * K0[2]
Output[0, 1, 2, 3] += R1[0, 1, 2, 3] * K1[0]
Output[0, 1, 2, 3] += R1[1, 2, 3, 4] * K1[1]
Output[0, 1, 2, 3] += R1[2, 3, 4, 5] * K1[2]
Output[0, 1, 2, 3] += R2[0, 1, 2, 3] * K2[0]
Output[0, 1, 2, 3] += R2[1, 2, 3, 4] * K2[1]
Output[0, 1, 2, 3] += R2[2, 3, 4, 5] * K2[2]

Another implementation requires 4 instructions to calculate one element of the output. This method is slower than the previous method but can be used to calculate the remainder of output.

V   = R0[0, 1, 2, x] * K0[0, 1, 2, x]
V += R1[0, 1, 2, x] * K1[0, 1, 2, x]
V += R2[0, 1, 2, x] * K2[0, 1, 2, x]
Output[0] = SUM(V)

@hedaoyuan hedaoyuan changed the title Convolution Depthwise Convolution Optimization Aug 28, 2017
@hedaoyuan hedaoyuan requested a review from NHZlX August 28, 2017 09:51
Copy link
Contributor

@NHZlX NHZlX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/function/DepthwiseConvOp.cpp#L21 这里边的函数是不是可以去掉了,是不是应该并且加一些check, 比如, device必须是 gpu

const float*, const float*, int, int, int, int, int, int, float*)>
DepthWiseConv;

if (filterWidth == 3 && strideW() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我认为这里应该把最朴素的实现给添加上,并且我认为https://github.com/NHZlX/Paddle/blob/mobilenet_neon/paddle/function/neon/DepthwiseConvCpu.h#L98 这种实现会好一些

Copy link
Contributor Author

@hedaoyuan hedaoyuan Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不应该加上朴素的实现,加上的话,相当于如果不支持优化实现则走朴素的实现,但实际上如果不支持优化实现,转而执行GemmConv实现更好。另外,NaiveConv本身就有一Function的实现了,可以在ConvLayer里面判断该走哪个分支。

@NHZlX
Copy link
Contributor

NHZlX commented Aug 29, 2017

LGTM

NHZlX
NHZlX previously approved these changes Aug 29, 2017
@hedaoyuan hedaoyuan merged commit b45d020 into PaddlePaddle:develop Aug 30, 2017
heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants