Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[registration] Add OMP based Multi-threading to SampleConsensusPrerejective #4433

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

koide3
Copy link
Contributor

@koide3 koide3 commented Oct 5, 2020

This PR adds OMP-based multi-threading capability to SampleConsensusPrerejective.

Changes are as follows:

  • const qualifiers are added to several methods to clarify that they are thread-safe
  • getFitness() is modified such that it uses per-thread transformation variable
  • similar_features are precomputed for concurrency

@kunaltyagi
Copy link
Member

@koide3 Could you please tag the PR appropriately?

PS: Failures on Mac

@koide3
Copy link
Contributor Author

koide3 commented Oct 5, 2020

@kunaltyagi It seems I don't have permission to edit tags...
I'll take a look at the errors on Mac later.

Copy link
Member

@kunaltyagi kunaltyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes LGTM. Haven't reviewed the code deeply

@kunaltyagi kunaltyagi added changelog: ABI break Meta-information for changelog generation changelog: deprecation Meta-information for changelog generation module: registration needs: more work Specify why not closed/merged yet labels Oct 12, 2020
@kunaltyagi kunaltyagi added needs: code review Specify why not closed/merged yet and removed needs: more work Specify why not closed/merged yet labels Oct 12, 2020
kunaltyagi
kunaltyagi previously approved these changes Oct 12, 2020
@koide3
Copy link
Contributor Author

koide3 commented Nov 3, 2020

I did rebase and squash to resolve the conflict.

@kunaltyagi
Copy link
Member

@SunBlack could you please take a look at the OMP pragma sites?

@SunBlack
Copy link
Contributor

SunBlack commented Nov 4, 2020

@SunBlack could you please take a look at the OMP pragma sites?

Took a while until both compilers build the branch.

@kunaltyagi After removing Ubuntu 16.04 from Azure: do we still have a GCC 6-8 anywhere, so we can still be sure to have OPENMP_LEGACY_CONST_DATA_SHARING_RULE where necessary?

@koide3 Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.

@kunaltyagi
Copy link
Member

do we still have a GCC 6-8 anywhere?

Yes, 18.04 has GCC 7 on it (though we force install GCC 8 on it for now due to ... reasons)

@koide3
Copy link
Contributor Author

koide3 commented Nov 5, 2020

@SunBlack

Are you are sure about the usage of firstprivate? A copy of the vectors is made for each thread.

Yes, one copy of the them must be created for each thread.

@mvieth
Copy link
Member

mvieth commented Nov 17, 2020

Do you have an estimate which parts of computeTransform take the most time? I am asking to determine whether it really makes sense to parallelize both for loops like you did, or if perhaps only the parallelization of the first loop is good because the second loop is fast anyway and the overhead of the parallelization is too large - just a thought.
Also it would be interesting to see how large the speedup actually is, e.g. if you run the openmp version with 2 or 4 threads, how much faster is it than running it with one thread? That could give hints whether the openmp implementation still has opportunities for optimization.

@koide3
Copy link
Contributor Author

koide3 commented Nov 18, 2020

Here is a brief benchmark result. The second for loop, which is the main loop of RANSAC, is the most expensive part in this algorithm, while the first for loop is not so heavy. It seems the overhead of omp is not so large in this case, and both the loops get much faster with omp.

15950 pts vs 15772 pts (FPFH)

with OMP
threads   1st[msec] 2nd[msec]  total[msec]
1         261.082   22999.672  23260.755
2         118.717   12271.585  12390.303
3          79.286    8156.884   8236.170
4          63.650    7475.746   7539.397
5          53.186    5391.872   5445.058
6          44.846    4408.987   4453.833
7          43.878    3903.800   3947.678
8          36.810    4014.925   4051.735
9          33.360    3960.413   3993.773

without OMP
1          263.646  23792.844  24056.490
1          226.914  24037.704  24264.618
1          226.266  22705.109  22931.375
1          221.195  25475.866  25697.061

Copy link
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your benchmark looks really promising!
Do you think it would make sense to add a unit test, e.g. to copy the one in test_sac_ia.cpp and run it again with 4 threads?

@kunaltyagi kunaltyagi self-requested a review November 19, 2020 08:03
@koide3
Copy link
Contributor Author

koide3 commented Nov 19, 2020

Can I do rebase and force-push to resolve the conflicts?

Copy link
Contributor

@JStech JStech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good--just a couple minor things

@koide3
Copy link
Contributor Author

koide3 commented Nov 24, 2020

Conflicts are resolved.

@kunaltyagi
Copy link
Member

@JStech @mvieth Please resolve the conversations that have reached conclusion. It becomes a bit difficult to see if there's work left

@kunaltyagi kunaltyagi dismissed their stale review November 24, 2020 11:46

Loads of updates

@kunaltyagi kunaltyagi self-requested a review November 24, 2020 11:47
@JStech
Copy link
Contributor

JStech commented Nov 25, 2020

Somehow my comment threads don't have "Resolve" buttons--maybe because I made them comments and not change requests in the first place? I do consider them all resolved.

mvieth
mvieth previously approved these changes Feb 3, 2021
Copy link
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay, the changes look good.
Only the windows CIs are failing, but I seem to be unable to rerun them, probably because the runs are too old? Perhaps if you close and reopen this PR?

@kunaltyagi kunaltyagi added the needs: author reply Specify why not closed/merged yet label May 26, 2021
Hopefully this fixes compile error on windows: "only a variable or static data member can be used in a data-sharing clause"
@mvieth
Copy link
Member

mvieth commented Aug 23, 2022

I checked again and the drawing of the random samples (selectSamples) should not be done in parallel (calls to rand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop in getFitness parallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.

@larshg
Copy link
Contributor

larshg commented Aug 24, 2022

I checked again and the drawing of the random samples (selectSamples) should not be done in parallel (calls to rand()). That has to either be wrapped in omp critical, or maybe a better approach would be to make the for-loop in getFitness parallel and keep the main for-loop sequential. That would be safer, but I haven't checked yet how that would compare in terms of speed-up.

Ahh, yes that seems to cause problems. Atleast its not as random as it should be and according to some reading, can lead to UB, when calling rand() in multithreaded manner.

Copy link
Contributor

@larshg larshg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into calling rand() in multithreaded.

@larshg
Copy link
Contributor

larshg commented Aug 24, 2022

I guess the estimateRigidTransformation is also somewhat compute intensive, so the better option is to guard the random sampling?

PointCloudSource input_transformed;
input_transformed.resize(input_->size());
transformPointCloud(*input_, input_transformed, final_transformation_);
transformPointCloud(*input_, input_transformed, transformation);

// For each point in the source dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nn_indices and nn_dists could be moved outside the for loop to reduce (de)allocations.

@mvieth
Copy link
Member

mvieth commented Aug 24, 2022

I guess the estimateRigidTransformation is also somewhat compute intensive, so the better option is to guard the random sampling?

I can't confirm that: In my tests, less than 1% of the time of computeTransformation was spent in estimateRigidTransformation. Almost all of the time was spent in getFitness and finding similar features. Another reason to not parallelize the main loop: currently, the loop runs exactly max_iterations_ times, but it could make sense to stop earlier if a good solution has been found. With a parallel main loop, that would be more difficult to implement.
I will check how well getFitness can be parallelized.

@larshg
Copy link
Contributor

larshg commented Aug 24, 2022

Yeah okay, that doesn't sound like much. Lets see what your tests shows 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: ABI break Meta-information for changelog generation changelog: deprecation Meta-information for changelog generation module: registration needs: author reply Specify why not closed/merged yet needs: code review Specify why not closed/merged yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants