Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid huge index jumps in RandomSample. Remove io dependency from 2d. #2141

Merged
merged 3 commits into from
Dec 19, 2017
Merged

Conversation

ThorstenHarter
Copy link
Contributor

I have noticed that using pcl::RandomSample to reduce a cloud of 125.000 points to a cloud of 50.000 will sometimes result in a large gap of several centimeters in x-direction in my Lidar data.

The problem is located in random_sample.hpp:

  float V = unifRand ();
  unsigned S = 0;
  float quot = static_cast<float> (top) / static_cast<float> (N);
  while (quot > V)
  {
    S++;
    top--;
    N--;
    quot = quot * static_cast<float> (top) / static_cast<float> (N);
  }
  index += S;

When V is close to 0, the index will make a huge jump (e.g. > 20.000), resulting in the gap we see in the attached screenshot.
To fix this, I have set a lower limit of 0.01 for the random number V and made good experiences with that.

resampled torus

@SergioRAgostinho SergioRAgostinho added the needs: code review Specify why not closed/merged yet label Dec 11, 2017
@SergioRAgostinho
Copy link
Member

Thanks 👍

Although it seems like a minimal fix, I'll need to check the original paper before to make sure we're not imposing some statistical bias of some sort.

@SergioRAgostinho
Copy link
Member

SergioRAgostinho commented Dec 12, 2017

Just had a look at things and I think I figured out what's the problem. I think the issue is here

quot = quot * static_cast<float> (top) / static_cast<float> (N);

Try to modify it to simply

quot = static_cast<float> (top) / static_cast<float> (N); 

and report back the results please.

You'll get the added bonus of it becoming in average three times faster ;)

To fully prove that that was the issue I would still like to verify the mean and std deviation of the difference of consecutive filtered indices, not the removed ones, before and after the correction. If you could output that from your test case and report the results it would be awesome.

In the best scenario the mean will be equal close to N / sample_size and the std as close to 0 as possible.

@SergioRAgostinho SergioRAgostinho added needs: author reply Specify why not closed/merged yet and removed needs: code review Specify why not closed/merged yet labels Dec 12, 2017
@ThorstenHarter
Copy link
Contributor Author

Hi Sergio,
the original line matches the pseudo code in the paper:

quot := quot x top/N

I'm afraid the proposed change makes the problem even worse:

resampled torus 2

If V is very small (e.g. 0.00001), the while loop will only stop when "top" (number of remaining points) is also close to zero.

The first result I posted might even be valid from a statistical viewpoint, but my common sense tells me that it's not a valid sample.
So what I did is a visual inspection of a range of different sample sizes in the 3D viewer.
I'll see if I can get some numbers in a spreadsheet.

@ThorstenHarter
Copy link
Contributor Author

I made a smaller example, picking 1000 from 25344 points and logging the index increment (S+1).

Here are the numbers:

value master branch
min 1 1
max 258 130
median 18 18
mean 25.258 25.253
std 25.853 24.619

My proposed fix prevents extreme runaways of the "max" value and therefore also lowers "mean" and "std" somewhat.
In my opinion this is desirable.

@SergioRAgostinho SergioRAgostinho added needs: code review Specify why not closed/merged yet and removed needs: author reply Specify why not closed/merged yet labels Dec 12, 2017
@SergioRAgostinho
Copy link
Member

First of all thanks for all the trouble.

the original line matches the pseudo code in the paper

You're right. I completely misinterpreted the paper's "pseudo factorial/combinatorial symbol" as an index. Your visual results clearly show it.

Here are the numbers:

value master branch
min 1 1
max 258 130
median 18 18
mean 25.258 25.253
std 25.853 24.619

Your results are definitely encouraging, but the change you propose completely breaks one assumption from the paper which is plastered all around: the uniform variable needs to be generated between 0 and 1, and this is the main reason I haven't accepted your proposal.

I started to think what exactly is the problem which makes things fail specifically in your case and I think it is the sequential nature of your point cloud, both in terms of indexing and spatially distribution. This means that when there are bigger index jumps you get visible holes.

This "algorithm A" was implemented due to it's speed advantage, since it only needs to instantiate n times the random number generator vs the ~N. Statistically it ensures an approximation to the uniform distribution and in most point clouds it will work just fine, but if there's an organization such as yours, it can produce very bad results.

My proposal to your problem is that we actually implement "algorithm S" as an alternative. This one is potentially slower (probably not that much), but if a user has special requirements like you have, this can still save your day. It is slower, less prone to statistical problems and it needs to be explicitly selected by the user.

The first thing now is to actually verify that the "Algorithm S" solves your problems.

@SergioRAgostinho SergioRAgostinho added needs: testing Specify why not closed/merged yet needs: author reply Specify why not closed/merged yet and removed needs: code review Specify why not closed/merged yet needs: testing Specify why not closed/merged yet labels Dec 12, 2017
@ThorstenHarter
Copy link
Contributor Author

ThorstenHarter commented Dec 15, 2017

Hi Sergio,
this sequential indexing is a feature which all sensor generated point clouds that I work with share, because they are created either by a 2D line scan plus relative movement or by a 2D time-of-flight sensor.

I have tried Algorithm S and it provides a satisfying result in the same run time.

resampled torus algorithm s

Here's the code I have written:

// Algorithm S
unsigned int i = 0;
unsigned int index = 0;
std::vector<bool> added;
if (extract_removed_indices_)
  added.resize (indices_->size (), false);
size_t n = sample_size;
while (n > 0)
{
  // Step 1: [Generate U.] Generate a random variate U that is uniformly distributed between 0 and 1.
  const float U = unifRand ();
  // Step 2: [Test.] If N * U > n, go to Step 4. 
  if ((N * U) <= n)
  {
    // Step 3: [Select.] Select the next record in the file for the sample, and set n : = n - 1 and N : = N - 1.
    // If n > 0, then return to Step 1; otherwise, the sample is complete and the algorithm terminates.
    if (extract_removed_indices_)
      added[index] = true;
    indices[i++] = (*indices_)[index++];
    --n;
    --N;
  }
  else
  {
    // Step 4: [Don't select.] Skip over the next record (do not include it in the sample), set N : = N - 1, and return to Step 1.
    ++index;
    --N;
  }
}

And here are the numbers for comparison:

value Algo A Algo A* Algo S
min 1 1 1
max 258 130 181
median 18 18 18
mean 25.258 25.253 25.322
std 25.853 24.619 24.128

My proposal would be to replace Algorithm A with Algorithm S, because I consider using the current implementation of Algorithm A as "dangerous".

@SergioRAgostinho SergioRAgostinho added needs: code review Specify why not closed/merged yet and removed needs: author reply Specify why not closed/merged yet labels Dec 18, 2017
@SergioRAgostinho
Copy link
Member

Looks like Algorithm S is the way to go then. I only started to be able to clock noticeable differences between the methods when dealing with tenths of million points, still in the 10ms order though.

Let's go for it 👍

@SergioRAgostinho SergioRAgostinho added needs: more work Specify why not closed/merged yet and removed needs: code review Specify why not closed/merged yet labels Dec 19, 2017
@ThorstenHarter
Copy link
Contributor Author

I have now comitted the code snippet described above to random_sample.hpp and all tests have passed.

Please note that there is also a commit in this PR which has nothing to do with the issue (1c9259f), I missed to create a new branch for that.
Should I revert the edit?

{
// Step 4: [Don't select.] Skip over the next record (do not include it in the sample), set N : = N - 1, and return to Step 1.
++index;
--N;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of U, you always decrement N and you always increment index. I suggest you move these operations outside of the if/else block. Basically the else is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an empty "else" which is removed by the compiler for better readability?

while (n > 0)
{
  // Step 1: [Generate U.] Generate a random variate U that is uniformly distributed between 0 and 1.
  const float U = unifRand ();
  // Step 2: [Test.] If N * U > n, go to Step 4. 
  if ((N * U) <= n)
  {
    // Step 3: [Select.] Select the next record in the file for the sample, and set n : = n - 1.
    if (extract_removed_indices_)
      added[index] = true;
    indices[i++] = (*indices_)[index];
    --n;
  }
  else
  {
    // Step 4: [Don't select.] Skip over the next record (do not include it in the sample).
  }
  // Set N : = N - 1.
  --N;
  ++index;
  // If n > 0, then return to Step 1; otherwise, the sample is complete and the algorithm terminates.
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include it as actual code, but I'm ok with a commented version of the else in order to visualize the "step 4" in context.

@SergioRAgostinho SergioRAgostinho added needs: author reply Specify why not closed/merged yet and removed needs: more work Specify why not closed/merged yet labels Dec 19, 2017
@SergioRAgostinho
Copy link
Member

Please note that there is also a commit in this PR which has nothing to do with the issue (1c9259f), I missed to create a new branch for that.
Should I revert the edit?

It's already a separate commit so for me is totally fine.

@@ -92,7 +92,7 @@ template<typename PointT>
void
pcl::RandomSample<PointT>::applyFilter (std::vector<int> &indices)
{
unsigned N = static_cast<unsigned> (indices_->size ());
unsigned int N = static_cast<unsigned> (indices_->size ());
Copy link
Member

@SergioRAgostinho SergioRAgostinho Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional comment. Everything which is gonna be used as an index needs to be of size_t type. Please apply that change all over this method.

Edit: As an index or as a maximum value for an index, aka just replace all your unsigned ints for size_t. We started having some "surprises" recently with indexes overflowing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this makes also some static_cast obsolete.

Copy link
Member

@SergioRAgostinho SergioRAgostinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final step, can you squash the three commits which affect random_sample.hpp only? Basically leave the "dead line removal" and "visualization dependency removal" untouched.

Thanks for your help and time. It was really cool to visualize the results and the corresponding "quality numbers".

@ThorstenHarter
Copy link
Contributor Author

I'm afraid I don't know how to do this, I have no experience with git and used the web interface for editing.
I have also GitHub Desktop installed, but I assume you still need the console for advanced operations...
Could you please give me some hints?

@SergioRAgostinho
Copy link
Member

  • Have a read through this.
  • Invoke the git rebase in interactive mode providing the last commit before you started applying changes.
  • During the interactive rebase, reorder things so that your random_sample.hpp related commits are all sequential one after the other
  • Select squash on the latest 2 and all 3 will be condensed into a single one.
  • When you push to your repo specify the option -f/--force

Try all this first in a copy of your fork to prevent accidents, until you're comfortable with the outcome.

ThorstenHarter and others added 2 commits December 19, 2017 19:40
If you try to do an algorithms-only build of PCL (without the "io" and "visualization" module), you will stumble accross this dependency which shouldn't exist in my opinion.
The "2d" module contains only a handful of file which implement morphological filters and does not include any files from the "io" module.
@SergioRAgostinho SergioRAgostinho added needs: code review Specify why not closed/merged yet and removed needs: author reply Specify why not closed/merged yet needs: code review Specify why not closed/merged yet labels Dec 19, 2017
@SergioRAgostinho SergioRAgostinho merged commit 0a302d3 into PointCloudLibrary:master Dec 19, 2017
@SergioRAgostinho SergioRAgostinho changed the title Avoid huge index jumps in RandomSample Avoid huge index jumps in RandomSample Aug 26, 2018
@SergioRAgostinho SergioRAgostinho changed the title Avoid huge index jumps in RandomSample Avoid huge index jumps in RandomSample. Remove io dependency from 2d. Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants