-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] Replace volatile shared memory with shfl_sync in KNNSearch #4306
Conversation
These lines here: pcl/gpu/octree/src/cuda/knn_search.cu Lines 181 to 210 in b0ff2df
Can also be shared using the shfl instructions. |
gpu/octree/src/cuda/knn_search.cu
Outdated
@@ -222,87 +222,50 @@ namespace pcl { namespace device { namespace knn_search | |||
} | |||
|
|||
template<int CTA_SIZE> | |||
__device__ __forceinline__ int NearestWarpKernel(const float* points, int points_step, int length, const float3& active_query, float& dist) | |||
__device__ __forceinline__ int NearestWarpKernel(const int beg, const int points_step, const int length, const float3& active_query, float& dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no out parameters if possible. use return variables. also remove the __forceinline__
from here. this method/function is huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we tidy up indentation in this function. it's looking slightly chaotic. run this function through clang-format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should point_step
be renamed to field_step
it seems to measuring byte spacing between x, y, and z fields.
gpu/octree/src/cuda/knn_search.cu
Outdated
float mind2 = dist2[tid]; | ||
//find minimum distance among warp threads | ||
constexpr unsigned FULL_MASK = 0xFFFFFFFF; | ||
static_assert(sizeof(KernelPolicy::WARP_SIZE) <= sizeof(unsigned int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is wrong. WARP_SIZE
represents the number of bits in the mask. sizeof
gives you the number of bytes. you also need an error message here explaining the user what is wrong if the static_assert is violated.
static_assert(sizeof(KernelPolicy::WARP_SIZE) <= sizeof(unsigned int)); | |
static_assert(KernelPolicy::WARP_SIZE <= 8*sizeof(unsigned int)); |
gpu/octree/src/cuda/knn_search.cu
Outdated
float next = __shfl_down_sync(FULL_MASK, dist2, bit_offset); | ||
int next_index = __shfl_down_sync(FULL_MASK, index, bit_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
gpu/octree/src/cuda/knn_search.cu
Outdated
} | ||
//retrieve index and dist2 | ||
index = __shfl_sync(FULL_MASK, index, 0); | ||
dist2 = __shfl_sync(FULL_MASK, dist2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this temp.
gpu/octree/src/cuda/knn_search.cu
Outdated
//retrieve index and dist2 | ||
index = __shfl_sync(FULL_MASK, index, 0); | ||
dist2 = __shfl_sync(FULL_MASK, dist2, 0); | ||
dist = sqrt(dist2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const (in case it stops being an out parameter)
gpu/octree/src/cuda/knn_search.cu
Outdated
dist = sqrt(dist2[tid - lane]); | ||
return index[tid - lane]; | ||
} | ||
return index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return index; | |
return std::make_pair(index, dist); |
I'm recieving the const auto nearestPoint = NearestWarpKernel<KernelPolicy::CTA_SIZE>(beg, batch.points_step, end - beg, active_query);
if (active_lane == laneId)
if (min_distance > nearestPoint.second)
{
min_distance = nearestPoint.second;
min_idx = beg + nearestPoint.first;
} |
If the return type doesn't matter to the API/ABI (or explicitly needs to be given no API/ABI guarantee), you can use the following super-legal code auto myFunction(int a, double b) -> decltype(auto) {
struct GuessTheReturnType {
bool correct = false;
float magic = 4.2e1;
};
return GuessTheReturnType{a, b};
}
// usage
auto& val = myFunction(9, 33.);
if (val.correct) { return val.magic; } |
@larshg please merge in case you're ok with the changes. |
https://github.com/PointCloudLibrary/pcl/blob/master/cmake/pcl_find_cuda.cmake This one should probably also be revised soon. Not sure what cuda version is required for __shfl_sync instructions etc. Right now it adds many cuda arch versions and with a new configure, it added 3.0 etch which failed to build. |
Good catch. They were introduced in CUDA 9. CUDA C+14 support was also only introduced in CUDA 9. @PointCloudLibrary/maintainers any opposition to bump CUDA's minimum version to 9? |
Please go ahead and merge @SergioRAgostinho. I don't have those powers 😄 |
Fix for distance calculation issue caused by the use of volatile shared memory in GPU K nearest neighbour search.