Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a subtle bug in DBSCAN noise marking #525

Merged
merged 3 commits into from
May 28, 2021

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented May 27, 2021

Situation: a core point is connected to only boundary points. However,
those boundary are connected to other core points. So it may happen that
the boundary points are assigned to other clusters, resulting in a
cluster with a single point (core).

The bug was that those single core clusters were marked as noise.

The fix is ugly as heck, but I don't know how to do better.

@aprokop aprokop added bug Something isn't working clustering Anything to do with clustering algorithms labels May 27, 2021
@aprokop aprokop requested a review from dalg24 May 27, 2021 20:32
aprokop added 2 commits May 27, 2021 16:43
Situation: a core point is connected to only boundary points. However,
those boundary are connected to other core points. So it may happen that
the boundary points are assigned to other clusters, resulting in a
cluster with a single point (core).

This test makes sure that we don't accidentally label it as noise, as
its label is likely to be itself.
examples/dbscan/ArborX_DBSCANVerification.hpp Show resolved Hide resolved
Comment on lines 208 to 227
if (is_special_case)
{
// Cannot use CCSCore as it always returns true
Kokkos::parallel_for("ArborX::dbscan::mark_noise",
Kokkos::RangePolicy<ExecutionSpace>(exec_space, 0, n),
KOKKOS_LAMBDA(int const i) {
if (cluster_sizes(labels(i)) == 1)
labels(i) = -1;
});
}
else
{
DBSCAN::DBSCANCorePoints<MemorySpace> is_core{num_neigh, core_min_size};
Kokkos::parallel_for("ArborX::dbscan::mark_noise",
Kokkos::RangePolicy<ExecutionSpace>(exec_space, 0, n),
KOKKOS_LAMBDA(int const i) {
if (cluster_sizes(labels(i)) == 1 && !is_core(i))
labels(i) = -1;
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted that patch out of curiosity and it looks like the unit tests are still passing.

Copy link
Contributor Author

@aprokop aprokop May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did the same. I think some permutation of the points in the test may trigger the failure in Serial, but it's really hard to get to it. It did fix the failure when verifying 500K subset of HACC points, run as

$ ./ArborX_DBSCAN.exe --binary --filename hacc.arborx --print-dbscan-timers --core-min-size 5 --cluster-min-size 1  --eps 0.042 --impl fdbscan-densebox --max-num-points 500000 --verify

and it also fixed divergence in the number of clusters and number of points in clusters when running HACC problem using different algorithms (as long as --cluster_min_size=1 was also specified).

@aprokop
Copy link
Contributor Author

aprokop commented May 28, 2021

CI failure is not a failure, but a warning from a container build.

src/ArborX_DBSCAN.hpp Show resolved Hide resolved
src/ArborX_DBSCAN.hpp Outdated Show resolved Hide resolved
@aprokop aprokop merged commit fc78f08 into arborx:master May 28, 2021
@aprokop aprokop deleted the dbscan_bugfix branch May 28, 2021 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working clustering Anything to do with clustering algorithms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants