-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the Akaze feature detector #67
Comments
@vadixidav and @xd009642 on my AMD Ryzen 9 7950X I get a factor 2x performance improvement with changes in #68, essentially by using Rayon in obvious places. I believe the code can be improved further, but serious profiling is needed to do so. Modern processors are so dependent on thermal conditions that it is hard to just compare execution timings. For the record, I tried to replace all Note: the PR will be more clear once #65 is merged. |
@stephanemagnenat I previously found that it was more efficient from a throughput perspective to perform akaze extraction on multiple frames in parallel rather than parallelizing the individual frames, which optimizes for latency. If you add rayon parallelization, please make rayon an optional dependency and gate out the rayon code when it isn't enabled, using ndarray or similar instead. For robotics, it may be preferable to use rayon for low latency odometry, whereas for mapping purposes it may be preferable to boost throughput. |
Of course, rayon is an optional dependency. I always have first the normal code without rayon, and then the rayon-specific code (see #68). Also, my feeling at the moment is that rayon should be used smartly, for example to not spawn threads when processing tiny images. For my application, which is related to real-time vision, latency is key. I will probably also do some more detailed profiling, because one target is WASM, where rayon is generally not available, and currently the performance of the normal code is much worst than the equivalent OpenCV CPU code. |
Some profiling shows that the hottest function is |
So I have rewritten the filters by considering the memory constraints. They seem to be 15% faster than the imageproc versions, and are unit tested against these. They are still far slower than the OpenCV equivalents, mostly due to the lack of user-exposed fast math in stable Rust, preventing Rust to vectorize the inner loops. There is an interesting related discussion on the rust user forum. If we would be able to force fast math, we could expect a factor 5 of speed-up! But that is only possible on nightly! Alternatively, we can wait for portable SIMD to stabilize and use that instead. My versions are already a step in that direction. That state of affair is a bit sad though... |
@vadixidav @xd009642 I've implemented "obvious" SIMD versions of the image filters. The first version, for horizontal filter, has 2x improvement for small kernel (7 wide) and 5x for large ones (71 wide). For the vertical filter, the improvement ranges from 20% (7 wide) to 4x (71 wide). I believe that there is still a bit of room for improvement (manually unrolled the inner loop a bit), but that is for later. It might also be worth waiting for the portable SIMD initiative to land for doing that. This brings several questions for which I'll like your opinion:
Please note that I'm happy to split this PR, for example to merge the SIMD filters earlier than the use of rayon, if these are less contreversial. |
I think that this is ready to be merged. There are two kinds of changes:
My most pressing need is 1., as I wish to deploy Akaze in an embedded WASM application. But as 2. is behind a feature flag, I think that it makes sense to merge as well. @vadixidav and @xd009642 anything else you need from me before merging this? |
I don't think there's anything extra, just need to find the time to go over and review the code. One thought would be if the CI doesn't test it builds against any wasm targets with the desired features it may be worth adding that to CI to make sure nothing's added in future that breaks that. |
Oh there is a Cargo.toml conflict after I merged the serde PR, so that is one thing needed. Btw I'm away this weekend so may not get to it until next week. Just trying to get through some stuff before I leave 😄 |
Rebased and made clippy happy. |
Should we also test the "rayon" feature in the CI? |
Ideally we should to avoid fixed only being applied to one and not the other |
Done in 721790e. The PR is ready for review. |
Please note that due to a limitation in github cargo actions, I could not run the test for the rayon |
@xd009642 now that the unit test is added for the |
I just realized that |
I'd rather use a replacement than make it optional, it's a bit of code noise wrapping all the instants and ideally features should be additive and not conflict with each other |
Ok, fine for me, I'll go for the replacement then! Thanks for the input! |
Sure and I'll try to find some time today or tomorrow for the (hopefully) final review before merge 👍 |
So the replacement doesn't have proper support for |
One question @xd009642, this PR adds the But elsewhere in CV there is a That is actually not necessary, as dependencies can have So I assume that it is historical, and so far have been using just |
Yeah the serde thing is historical I'd prefer the simpler serde feature - and updating the crates to do the same thing (in a future PR though) |
Ok, good, we keep it this way then. The PR is updated with a replacement on WASM target if the optional I had to do this because the WASM target is currently |
I'm closing the issue as the related PR has been merged. |
@stephanemagnenat i did in the past a test to export to wasm https://github.com/kalwalt/wasm-akaze-example i will try with these recent changes and i will let you know. |
@kalwalt great, on my side I successfully deployed to WASM (with SIMD extension) for our application (a coloring book extension to Candli). On my phone (Galaxy S9), it takes about 2s to capture an image, find the Akaze points, match them to a template, find the homography with some outlier rejection, extract a sub-part of the image, rectify and display that part. The phone camera is configured upon This is nowhere native performance but it is usable. |
While working on fixing #63, I'm going through all of Akaze's source code quite deeply, and so to not loose the understanding, I'm collecting here the low-hanging fruit optimization opportunities I'm discovering. These are:
create_nonlinear_scale_space
, theLx
andLy
images could be computed in parallel, as these only accessLsmooth
in read. This should be useful for the first evolutions when images are of a significant size.find_scale_space_extrema
, there is a O(n²) loop for testing duplicates. This could be optimized to O(n) using some form of spacing hashing.A note here: I did not consider micro-optimizations, such as parallelizing loops within simple image computations. If so, more could be done, but the benefit is not fully clear due to the overhead cost of spawning threads.
The text was updated successfully, but these errors were encountered: