-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce OrderedIntersection TreeTraversal #683
Conversation
I think this is in a good enough state to discuss now. The example compares the new traversal that processes a callback in order of the distance between ray and box with an approach that first collects all intersections then sorts them and finally processes the callback. Typical speedups I observe are between 10x and 20x (also depending on sorting predicates which is sometimes beneficial). The new tag is called |
src/details/ArborX_Ray.hpp
Outdated
template <typename Geometry> | ||
KOKKOS_INLINE_FUNCTION void | ||
overlapDistance(Ray const &ray, Geometry const &geometry, float &length, | ||
float &distance_to_origin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was already not sure about exposing overlapDistance(Ray, Sphere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only kept adding it since overlapDistance(Ray, Sphere)
was already there.
IMHO, it's a useful helper function for Ray
s. I don't have a problem moving it to the example, though.
Only pushing to the stack when necessary gives
vs.
when running
That's to say it's like 30% faster (4.64e-01 vs 6.02e-01). |
c333180
to
8983d7d
Compare
Avoiding the batched sort, the profile looks more like
so we are in total like 3.6x faster with the new approach (4.61e-01s vs 1.66s). Even just considering the traversals, the new approach is slightly after (4.61e-01s vs. 4.92e-01s). Without sorting the predicates, we get
which is marginally better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will likely have to be updated for #675.
if (invoke_callback_and_check_early_exit( | ||
_callback, predicate, | ||
HappyTreeFriends::getLeafPermutationIndex(_bvh, node))) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess it would still allow one to find just k
nearest ones, while processing them in the distance closest order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the question is whether we want it to rather behave like a spatial predicate or like a nearest predicate. I tend to nearest which is why I choose OrderedNearest
for now.
Also, this would be one of the traversals raytracing algorithms commonly use (multi-hit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that this has nearest flavor. We don't check predicate for traversing, only on the distance. Nearest
though implies a specific order, so does not describe the situation well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do check predicates, every time we compare to infinity
return distance(geometry, box); | ||
}; | ||
|
||
using PairIndexDistance = Kokkos::pair<int, float>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are trying to generalize this, the second argument should the type returned by the distance
function. As it can be provided by a user, this will have to be determined and not hardcoded.
Using Nearest traversals for the benchmark using k=100 gives
In particular, this takes 1.38e+00s compared to the 4.61e-01s for the new traversal. |
|
Of course, we also don't need to allocate global memory on the device for the traversal with the approach here as opposed ti nearest traversals. |
Do we want to change the name of the file 'example_raytracing.cpp' to be more specific? This example is a raytracing problem in absorbing media. It is more challenging in certain way, but might not be a general/typical problem for raytracing (e.g. between surfaces, scattering media, etc.). |
examples/raytracing/CMakeLists.txt
Outdated
@@ -1,3 +1,3 @@ | |||
add_executable(ArborX_RayTracing.exe example_raytracing.cpp) | |||
target_link_libraries(ArborX_RayTracing.exe ArborX::ArborX Boost::program_options) | |||
add_test(NAME ArborX_RayTracing_Example COMMAND ./ArborX_RayTracing.exe --spheres=1000 --rays=50000 --L=200.0) | |||
add_test(NAME ArborX_RayTracing_Example COMMAND ./ArborX_RayTracing.exe --Nx=10 --Ny=10 --Nz=10 --Lx=1 --Ly=100000 --Lz=100000 --rays=1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more common GCD would be like 36x36x36, 48x48x48 for a 3-D problem.
Do you set this aspect ratio of the domain size for any purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's pretty random. Note that I use
./ArborX_RayTracing.exe --Nx=100 --Ny=100 --Nz=100 --rays=10
for all timings posted above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100x100x100 might be too large.
Maybe you can keep the number of rays per cell, but reduce the size to 48x48x48 or 60x60x60.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 60x60x60, the profiling looks like
TOTAL TIME: 0.253154 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 2.02e-01 sec 79.7% 37.0% 0.0% 60.3% 7.43e+01 1 second_approach [region]
| |-> 6.93e-02 sec 27.4% 92.2% 0.0% 0.0% 1.73e+02 1 ArborX::query [region]
| | |-> 6.93e-02 sec 27.4% 92.2% 0.0% 0.7% 1.73e+02 1 ArborX::CrsGraphWrapper::query::spatial [region]
| | |-> 6.64e-02 sec 26.2% 96.0% 0.0% 1.0% 1.36e+02 1 ArborX::CrsGraphWrapper::two_pass [region]
| | | |-> 4.11e-02 sec 16.2% 95.7% 0.0% 4.2% 7.30e+01 1 ArborX::CrsGraphWrapper::two_pass:second_pass [region]
| | | | |-> 3.93e-02 sec 15.5% 99.9% 0.0% 0.1% 2.54e+01 1 ArborX::BVH::query::spatial [region]
| | | | | |-> 3.93e-02 sec 15.5% 100.0% 0.0% ------ 1 ArborX::TreeTraversal::spatial [for]
| | | |-> 1.85e-02 sec 7.3% 99.8% 0.0% 0.1% 5.40e+01 1 ArborX::CrsGraphWrapper::two_pass::first_pass [region]
| | | | |-> 1.85e-02 sec 7.3% 99.9% 0.0% 0.1% 5.40e+01 1 ArborX::BVH::query::spatial [region]
| | | | |-> 1.85e-02 sec 7.3% 100.0% 0.0% ------ 1 ArborX::TreeTraversal::spatial [for]
| | | |-> 6.01e-03 sec 2.4% 97.3% 0.0% 2.7% 6.65e+02 1 ArborX::CrsGraphWrapper::first_pass_postprocess [region]
| | | | |-> 5.75e-03 sec 2.3% 100.0% 0.0% ------ 1 ArborX::Algorithms::exclusive_scan [scan]
| | |-> 2.20e-03 sec 0.9% 7.2% 0.0% 92.8% 9.07e+02 1 ArborX::CrsGraphWrapper::query::spatial::compute_permutation [region]
| | |-> 2.70e-04 sec 0.1% 10.3% 0.0% 89.7% 3.70e+03 1 ArborX::CrsGraphWrapper::query::spatial::init_and_alloc [region]
| |-> 9.95e-03 sec 3.9% 100.0% 0.0% ------ 1 deposit_energy [for]
| |-> 8.45e-04 sec 0.3% 100.0% 0.0% ------ 1 ArborX::Algorithms::iota [for]
|-> 3.98e-02 sec 15.7% 94.0% 0.0% 0.0% 7.54e+01 1 first_approach [region]
| |-> 3.98e-02 sec 15.7% 94.0% 0.0% 0.6% 7.55e+01 1 ArborX::BVH::query::nearest [region]
| |-> 3.72e-02 sec 14.7% 100.0% 0.0% ------ 1 ArborX::Experimental::TreeTraversal::OrderedNearestPredicate [for]
| |-> 2.31e-03 sec 0.9% 6.8% 0.0% 93.2% 8.66e+02 1 ArborX::BVH::query::nearest::compute_permutation [region]
|-> 5.69e-03 sec 2.2% 31.4% 0.0% 0.3% 1.06e+03 1 problem_setup [region]
| |-> 5.33e-03 sec 2.1% 31.6% 0.0% 68.4% 9.39e+02 1 make_rays [region]
| | |-> 7.40e-04 sec 0.3% 100.0% 0.0% ------ 1 initialize_rays [for]
| | |-> 5.57e-04 sec 0.2% 100.0% 0.0% ------ 1 "Kokkos::Random_XorShift64::state"="Kokkos::Random_XorShift64::state_mirror" [copy]
| | |-> 3.37e-04 sec 0.1% 100.0% 0.0% ------ 1 "Kokkos::Random_XorShift64::locks"="Kokkos::Random_XorShift64::locks_mirror" [copy]
| |-> 3.44e-04 sec 0.1% 29.2% 0.0% 70.8% 2.90e+03 1 make_grid [region]
|-> 2.08e-03 sec 0.8% 25.1% 0.0% 14.0% 3.36e+03 1 ArborX::BVH::BVH [region]
| |-> 1.02e-03 sec 0.4% 2.4% 0.0% 97.6% 9.83e+02 1 ArborX::BVH::BVH::sort_linearized_order [region]
| |-> 4.30e-04 sec 0.2% 89.3% 0.0% 17.7% 9.31e+03 1 ArborX::BVH::BVH::generate_hierarchy [region]
| | |-> 2.93e-04 sec 0.1% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::generate_hierarchy [for]
which is pretty similar (3.98e-02s vs. 2.02e-01s).
For 48x48x48, we have
TOTAL TIME: 0.133111 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 1.01e-01 sec 76.1% 35.3% 0.0% 60.4% 1.48e+02 1 second_approach [region]
| |-> 3.60e-02 sec 27.1% 88.0% 0.0% 0.0% 3.33e+02 1 ArborX::query [region]
| | |-> 3.60e-02 sec 27.1% 88.0% 0.0% 1.2% 3.33e+02 1 ArborX::CrsGraphWrapper::query::spatial [region]
| | |-> 3.30e-02 sec 24.8% 95.9% 0.0% 1.2% 2.73e+02 1 ArborX::CrsGraphWrapper::two_pass [region]
| | | |-> 1.70e-02 sec 12.8% 95.3% 0.0% 4.5% 1.77e+02 1 ArborX::CrsGraphWrapper::two_pass:second_pass [region]
| | | | |-> 1.62e-02 sec 12.1% 99.8% 0.0% 0.2% 6.19e+01 1 ArborX::BVH::query::spatial [region]
| | | | | |-> 1.61e-02 sec 12.1% 100.0% 0.0% ------ 1 ArborX::TreeTraversal::spatial [for]
| | | |-> 9.60e-03 sec 7.2% 99.6% 0.0% 0.2% 1.04e+02 1 ArborX::CrsGraphWrapper::two_pass::first_pass [region]
| | | | |-> 9.59e-03 sec 7.2% 99.7% 0.0% 0.3% 1.04e+02 1 ArborX::BVH::query::spatial [region]
| | | | |-> 9.56e-03 sec 7.2% 100.0% 0.0% ------ 1 ArborX::TreeTraversal::spatial [for]
| | | |-> 5.94e-03 sec 4.5% 97.8% 0.0% 2.2% 6.73e+02 1 ArborX::CrsGraphWrapper::first_pass_postprocess [region]
| | | | |-> 5.72e-03 sec 4.3% 100.0% 0.0% ------ 1 ArborX::Algorithms::exclusive_scan [scan]
| | |-> 2.41e-03 sec 1.8% 4.5% 0.0% 95.5% 8.31e+02 1 ArborX::CrsGraphWrapper::query::spatial::compute_permutation [region]
| | |-> 2.43e-04 sec 0.2% 6.2% 0.0% 93.8% 4.11e+03 1 ArborX::CrsGraphWrapper::query::spatial::init_and_alloc [region]
| |-> 3.70e-03 sec 2.8% 100.0% 0.0% ------ 1 deposit_energy [for]
| |-> 3.96e-04 sec 0.3% 100.0% 0.0% ------ 1 ArborX::Algorithms::iota [for]
|-> 2.13e-02 sec 16.0% 91.1% 0.0% 0.0% 1.41e+02 1 first_approach [region]
| |-> 2.13e-02 sec 16.0% 91.1% 0.0% 0.9% 1.41e+02 1 ArborX::BVH::query::nearest [region]
| |-> 1.93e-02 sec 14.5% 100.0% 0.0% ------ 1 ArborX::Experimental::TreeTraversal::OrderedNearestPredicate [for]
| |-> 1.80e-03 sec 1.4% 6.1% 0.0% 93.9% 1.11e+03 1 ArborX::BVH::query::nearest::compute_permutation [region]
|-> 5.38e-03 sec 4.0% 28.9% 0.0% 0.3% 1.12e+03 1 problem_setup [region]
| |-> 4.81e-03 sec 3.6% 30.7% 0.0% 69.3% 1.04e+03 1 make_rays [region]
| | |-> 6.76e-04 sec 0.5% 100.0% 0.0% ------ 1 "Kokkos::Random_XorShift64::state"="Kokkos::Random_XorShift64::state_mirror" [copy]
| | |-> 4.09e-04 sec 0.3% 100.0% 0.0% ------ 1 initialize_rays [for]
| | |-> 3.39e-04 sec 0.3% 100.0% 0.0% ------ 1 "Kokkos::Random_XorShift64::locks"="Kokkos::Random_XorShift64::locks_mirror" [copy]
| |-> 5.57e-04 sec 0.4% 13.9% 0.0% 86.1% 1.79e+03 1 make_grid [region]
|-> 1.77e-03 sec 1.3% 24.1% 0.0% 10.9% 3.96e+03 1 ArborX::BVH::BVH [region]
| |-> 8.78e-04 sec 0.7% 2.4% 0.0% 97.6% 1.14e+03 1 ArborX::BVH::BVH::sort_linearized_order [region]
| |-> 3.74e-04 sec 0.3% 82.9% 0.0% 24.9% 1.07e+04 1 ArborX::BVH::BVH::generate_hierarchy [region]
| | |-> 2.21e-04 sec 0.2% 100.0% 0.0% ------ 1 ArborX::TreeConstruction::generate_hierarchy [for]
| |-> 2.44e-04 sec 0.2% 13.2% 0.0% 86.8% 4.11e+03 1 ArborX::BVH::BVH::compute_linear_ordering [region]
meaning 2.13e-02s vs. 1.01e-01s.
test/tstQueryTreeRay.cpp
Outdated
template <typename DeviceType> | ||
struct AccessTraits<BoxesIntersectedByRayOrdered<DeviceType>, | ||
ArborX::PredicatesTag> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid defining these away from their being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/tstQueryTreeRay.cpp
Outdated
KOKKOS_FUNCTION void operator()(Predicate const &, | ||
int const primitive_index) const | ||
{ | ||
_ordered_intersections(_i++) = primitive_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incrementing does not need atomicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one predicate and every predicate is processed by a single thread so it's not really necessary to use atomics here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug begging to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it's still a variable local to the predicate even if there were more but sure I'll make it atomic, see 03e716f
test/tstQueryTreeRay.cpp
Outdated
Kokkos::View<ArborX::Box *, DeviceType> device_boxes("boxes", 10); | ||
Kokkos::deep_copy(exec_space, device_boxes, | ||
Kokkos::View<ArborX::Box *, Kokkos::HostSpace>( | ||
boxes.data(), boxes.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArborX/test/ArborXTest_StdVectorToKokkosView.hpp
Lines 23 to 24 in f60fc2c
template <typename DeviceType, typename T> | |
auto toView(std::vector<T> const &v, std::string const &lbl = "Test::Unnamed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/tstQueryTreeRay.cpp
Outdated
auto const host_ordered_intersections = Kokkos::create_mirror_view_and_copy( | ||
Kokkos::HostSpace{}, device_ordered_intersections); | ||
for (unsigned int i = 0; i < n; ++i) | ||
BOOST_TEST(host_ordered_intersections(i) == i); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take advantage of our Kokkos::View
adapters and the Boost native support to compare collection of values (grep for per_element
)
We try to keep boiler plate code minimal to focus on the test content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't because it's more concise this way IMHO but sure I'll change, see 03e716f
Kokkos::View<int *> offsets("offsets", 0); | ||
bvh.query(exec_space, RaysSecond<MemorySpace>{rays}, | ||
AccumRaySphereOptDist<MemorySpace>{boxes}, values, offsets); | ||
ArborX::Details::sortObjects(exec_space, values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a better comparison, one should create a vector of long long
or double
with each entry encoding a combination of key
and rid
. Then use it to produce a permutation. This would run about 2.5x faster than the sort on these structs with 4 values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a struct with only the required variables for sorting, shaves off like 0.5s. For a total of 1.13s for the original approach (vs 0.42s for the new approach).
old | new | |
---|---|---|
sorting | 1.08e+00s | 4.45e-01s |
deposit_energy | 8.62e-02s | 1.55e-01s |
total | 1.66se+00s | 1.07e+00s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review
src/details/ArborX_Predicates.hpp
Outdated
{ | ||
template <typename Geometry> | ||
KOKKOS_INLINE_FUNCTION OrderedIntersection<Geometry> | ||
ordered_nearest(Geometry const &geometry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to say "intersect"
I am tempted to define a tag Experimental::Ordered
and define
intersects(Experimental::Ordered, Geometry const&)
src/details/ArborX_Predicates.hpp
Outdated
@@ -22,6 +22,11 @@ struct NearestPredicateTag | |||
struct SpatialPredicateTag | |||
{}; | |||
} // namespace Details | |||
namespace Experimental | |||
{ | |||
struct OrderedIntersectionPredicateTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct OrderedIntersectionPredicateTag | |
struct OrderedSpatialPredicateTag |
src/details/ArborX_Callbacks.hpp
Outdated
std::is_same<PredicateTag, NearestPredicateTag>{}, | ||
std::is_same<PredicateTag, NearestPredicateTag>{} || | ||
std::is_same<PredicateTag, | ||
Experimental::OrderedIntersectionPredicateTag>{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense you are using invoke_callback_and_check_early_exit
in the traversal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, the changes to <ArborX_Ray.hpp>
can be dropped from this PR
test/tstQueryTreeRay.cpp
Outdated
ArborX::Experimental::Ray ray{ | ||
ArborX::Point{0, 0, 0}, | ||
ArborX::Experimental::Vector{1. / n, 1. / n, 1. / n}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please expand this test by shooting a second ray from (n, n, n)
with direction (-1/n, -1/n, -1/n)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!(std::is_same<PredicateTag, SpatialPredicateTag>{} || | ||
std::is_same<PredicateTag, | ||
Experimental::OrderedSpatialPredicateTag>{}) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
!(std::is_same<PredicateTag, SpatialPredicateTag>{} || | |
std::is_same<PredicateTag, | |
Experimental::OrderedSpatialPredicateTag>{}) || | |
(std::is_same<PredicateTag, SpatialPredicateTag>{} || | |
std::is_same<PredicateTag, | |
Experimental::OrderedSpatialPredicateTag>{}) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says as much as if the predicate is spatial
or ordered_spatial
, the return type must be void
or CallbackTreeTraversalControl
.
The next one says: if the predicate is nearest
, the return type must be void
.
I think that I didn't change what we check (except for adding the new predicate, of course) but made both static_asserts
more self-contained (in that we don't mention other predicates than the ones we want to impose a condition on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right. It was a bit hard to read for me. This is an optional suggestion then.
@dalg24 Do you have any outstanding concerns about this PR? |
No |
@masterleinad Could you please clean up the history before merge? |
Here you go! |
Retest this please. |
It makes sense to replace the current example since the new one covers more ground, but I could copy some more comments if necessary.
I decided to keep both variants since I find it instructional to see the performance difference.
ThePriorityBasedTreeTraversal
still needs to be moved.