-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of transformPointCloud/WithNormals()
functions
#2247
Improve speed of transformPointCloud/WithNormals()
functions
#2247
Conversation
} | ||
} | ||
|
||
void so3 (const float* src, float* tgt) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels misleading. You're taking single precision float pointers at interface level and then treating them as double in the implementation. Why not simply define them double and make equivalent use of _mm_load_pd1
? According to these guys, they're part of the SSE2 intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But these are point coordinates, e.g. data[]
member of PointXYZ
. They are always single-precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
I'll have three machines in which I'm interested in giving this benchmark a try. I'll report results later. |
CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz None
Native
SSE2
NO-SSE2
|
CPU: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz None
Native
SSE2
|
I don't like your Mac results! They make no sense. How can SSE code be slower? Can you check disassembly? ( Note that this file is overwritten by compilation of every "bench_transform_xxx" target. |
Oh, and by the way. Unless you changed line 12 in "bench_transforms.cpp", you are benchmarking double precision transforms! Apparently I forgot to switch back from |
The weirdest thing for me is how a 4 year old laptop is reporting better results than a highend desktop computer from last year, in the D_XYZ_NC test of the original pcl implementation. I'll dig through the assembly tomorrow. |
I pushed a few commits to the benchmarking repo:
|
Files from the desktop: tl;dr it looks good. I eyeball an 0.8 baseline in average for all flags/precision. no-sse yields no performance improvement/degradation with doubles. |
Files from the laptop: tl;dr It still reports abnormally fast values for the D_XYZ_NC tests (as a baseline). Everything is looking green in single floating point precision. Eyeballing the baseline average I would say ~0.85. For double precision all dense tests are slower. The sparse ones are running green. |
I had a look at your benchmarking results for single precision. The improvement is not that great on Mac, but still there is some. The assembly for optimized functions looks similar on your machines, though I would say that the |
The mac laptop already has some years, so FMA might not be available.
On a different PR or still on this one? |
Oh no, FMA is kind of ancient, think 200x.
This one, we should not check in code that worsens performance ;) |
My comment was derived from the info here. https://en.wikipedia.org/wiki/FMA_instruction_set#CPUs_with_FMA3 |
Mine from info here: https://en.wikipedia.org/wiki/FMA_instruction_set#History 😆 |
I finally found time to examine the results for double-precision benchmarks. Looking at the disassembly, the worse-than-baseline performance with Appleclang does not make sense, and here is why. Let's look at D_XYZ_NC test in The inner transform loop of the optimized version spans from LBB5_13: ## =>This Inner Loop Header: Depth=1
movq 48(%rbx), %rsi
movss (%rsi,%rcx), %xmm0 ## xmm0 = mem[0],zero,zero,zero
movss 4(%rsi,%rcx), %xmm1 ## xmm1 = mem[0],zero,zero,zero
shufps $0, %xmm0, %xmm0 ## xmm0 = xmm0[0,0,0,0]
cvtps2pd %xmm0, %xmm0
movapd %xmm11, %xmm3
mulpd %xmm0, %xmm3
addpd %xmm6, %xmm3
mulpd %xmm10, %xmm0
addpd %xmm7, %xmm0
shufps $0, %xmm1, %xmm1 ## xmm1 = xmm1[0,0,0,0]
cvtps2pd %xmm1, %xmm1
movapd %xmm13, %xmm2
mulpd %xmm1, %xmm2
addpd %xmm3, %xmm2
mulpd %xmm12, %xmm1
addpd %xmm0, %xmm1
movss 8(%rsi,%rcx), %xmm0 ## xmm0 = mem[0],zero,zero,zero
shufps $0, %xmm0, %xmm0 ## xmm0 = xmm0[0,0,0,0]
cvtps2pd %xmm0, %xmm0
movapd %xmm4, %xmm3
mulpd %xmm0, %xmm3
addpd %xmm2, %xmm3
mulpd %xmm14, %xmm0
addpd %xmm1, %xmm0
cvtpd2ps %xmm3, %xmm1
cvtpd2ps %xmm0, %xmm0
unpcklpd %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[0]
movapd %xmm1, (%rax,%rcx)
incq %rdx
movq 56(%r14), %rsi
movq (%r15), %rax
subq %rax, %rsi
sarq $4, %rsi
addq $16, %rcx
cmpq %rsi, %rdx
jb LBB5_13
jmp LBB5_24 It follows closely what is written in intrinsics, making use of packed additions/multiplications (they have The inner transform loop of the baseline version spans from LBB3_15: ## %scalar.ph
## =>This Inner Loop Header: Depth=1
movss -8(%rax), %xmm4 ## xmm4 = mem[0],zero,zero,zero
movss -4(%rax), %xmm5 ## xmm5 = mem[0],zero,zero,zero
xorps %xmm7, %xmm7
cvtss2sd %xmm4, %xmm7
xorps %xmm0, %xmm0
cvtss2sd %xmm5, %xmm0
movss (%rax), %xmm4 ## xmm4 = mem[0],zero,zero,zero
cvtss2sd %xmm4, %xmm4
movapd %xmm7, %xmm6
mulsd %xmm13, %xmm6
movapd %xmm0, %xmm5
mulsd %xmm3, %xmm5
addsd %xmm6, %xmm5
movapd %xmm4, %xmm6
mulsd %xmm2, %xmm6
addsd %xmm5, %xmm6
addsd %xmm1, %xmm6
xorps %xmm5, %xmm5
cvtsd2ss %xmm6, %xmm5
movss %xmm5, -8(%rsi)
movapd %xmm7, %xmm5
mulsd %xmm12, %xmm5
movapd %xmm0, %xmm6
mulsd %xmm14, %xmm6
addsd %xmm5, %xmm6
movapd %xmm4, %xmm5
mulsd %xmm10, %xmm5
addsd %xmm6, %xmm5
movapd %xmm11, %xmm6
addsd %xmm6, %xmm5
movapd %xmm8, %xmm6
cvtsd2ss %xmm5, %xmm5
movss %xmm5, -4(%rsi)
movapd %xmm15, %xmm5
mulsd %xmm5, %xmm7
mulsd %xmm6, %xmm0
addsd %xmm7, %xmm0
mulsd -144(%rbp), %xmm4 ## 16-byte Folded Reload
addsd %xmm0, %xmm4
addsd %xmm9, %xmm4
xorps %xmm0, %xmm0
cvtsd2ss %xmm4, %xmm0
movss %xmm0, (%rsi)
incq %rdx
addq $16, %rax
addq $16, %rsi
cmpq %rcx, %rdx
jb LBB3_15 This code does not take advantage of packed additions/multiplications (note
|
dababe1
to
04ae840
Compare
That's encouraging 👍 I'm rushing for deadlines till the end of the week so I won't be able to look properly at this and test it till then. |
This pull request has been automatically marked as stale because it hasn't had Come back whenever you have time. We look forward to your contribution. |
I'll give this a go on the Mac on the weekend. It's a good PR sitting here for too long just pending on my tests. |
Bottom line, everything is running green 🎊
|
Just had another quick look at the PR and if it's ok from your side I'll just merge it. |
Thanks for giving it a try. From my side it's ready. |
In this PR, 2 tests fail for APPVEYOR under the PLATFORM=x86. These are about:
I did check yet if it is directly relative to the modifications provided by this PR (because recently Appveyor fail regularly...) but as these fails don't seems to be those as usual, I though that I should say it here. I hope to find time to investigate these fails soon. Cheers guys. |
Thanks for letting us know. Since the dedicated tests for transform functions are running green, I'd expect that these failures are precision-related (i.e. the epsilons are too tight). |
For me, it doesn't seem like an epsilon error, because the message is something like this: It seems to be a Windows/Visual Studio specific error accrodingly to this link. I tried to reproduce the error on my Ubuntu laptop but unsuccessfully. Also, when I check the different AppVeyor logs, I don't always get the same errors. In some way, there is a bit of random there. Without a Windows machine (ARCHITECTURE=x86) with Visual Studio which compiles PCL in Debug mode, these bugs seem intracktable for me 😥 ... |
@UnaNancyOwen can you reproduce the results from these tests? |
@SergioRAgostinho Yes, These tests seems to failed. |
Maybe these tests should be fixed in a dedicated PR. |
My comment is not directly related to this PR but I regularly facing some issue with the Appveyor checker. I cannot reproduce the tests locally because I don't have a Windows machine. My question is: Do you know a way to run Windows test under Ubuntu? I was thinking about virtual machine or docker, do you think that's possible (easily)? |
I think running Windows in a virtual machine is the easiest way. Microsoft even provides images with developer environment setup: https://developer.microsoft.com/en-us/windows/downloads/virtual-machines. |
This is a valuable information, thank you. I'll give it a try as soon as possible 😉 |
transformPointCloud/WithNormals()
functions
tl;dr;
This adds an implementation of point cloud transform functions using SSE2 intrinsics. Depending on the transform scalar precision, compiler flags, compiler version, point type, and point cloud properties this may lead to up to 35% faster transforms of VGA-sized point clouds.
The old, non-optimized implementation is retained as a fallback for machines without SSE2 instructions.
Long version
Recently I was wondering if I can improve performance of
transformPointCloud()
family of functions. This is how they transform every point internally:pcl/common/include/pcl/common/impl/transforms.hpp
Lines 67 to 71 in c78482b
This manually spelled out vector-matrix product looks strange, especially since Eigen should properly align points and provide vectorized matrix operations. In fact, this already seemed strange to me nearly four years ago. So I asked why and the answer was:
I decided to check myself. My development environment is Ubuntu 16.04 with Eigen 3.3 and GCC 5.4. When I switched to use Eigen's vector-matrix multiplication operator, I observed 2x slowdown, indeed. Same with GCC 6.3. However, GCC 7.2 gave nearly 2x speeedup! Clearly, older compilers were not vectorizing properly.
Looking at the disassembly, I found that indeed they are not using vectorized SSE2 instructions. Funny enough, when I compiled with GCC 5.4 and
-msse2
flag (instead of-march=native
), I got vectorized code. Turns out, in native mode in addition to SSE2 extensions, FMA (fuse-multiply-add) instructions become enabled and compiler starts to abuse them.So simply switching to vector-matrix product is not an option because depending on the compiler version and compiler flags this introduces either speedups or slowdowns. An alternative was to implement everything directly with SSE2 intrinsics such that compiler can not screw up and is guaranteed to emit optimal SSE2 assembly.
Since PCL does not have a built-in benchmarking framework, I have a separate repository with the proposed implementation and benchmarks. Here are my results with VGA-sized point clouds on i7 from 2015:
For every cell two tests are run: baseline PCL transform and proposed transform. The reported number is the runtime (in microseconds) for the proposed transform and the fraction of the baseline time. Thus numbers less than 1.0 represent improved performance.
I am lazy to provide complete discussion of the results and let anyone interested study this table or even run tests himself. Just couple of points:
XYZ
point clouds are transformed 24% to 35% faster;XYZRGBNormal
point clouds is nearly 3 times slower thanXYZ
clouds, although it involves only double the number of operations. I conclude that memory access is the real bottleneck here.Closes #1255.
One last note: double-precision transforms can be made faster with AVX instructions. But I leave this as an exercise for the future generations.(This PR now contains AVX optimizations as well.)