Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Replace volatile shared memory with shfl_sync in KNNSearch #4306

Merged
merged 5 commits into from
Aug 11, 2020

Conversation

haritha-j
Copy link
Contributor

Fix for distance calculation issue caused by the use of volatile shared memory in GPU K nearest neighbour search.

@haritha-j haritha-j added priority: gsoc Reason for prioritization module: gpu changelog: fix Meta-information for changelog generation labels Aug 5, 2020
@haritha-j haritha-j marked this pull request as draft August 6, 2020 07:06
@haritha-j haritha-j marked this pull request as ready for review August 6, 2020 07:46
@larshg
Copy link
Contributor

larshg commented Aug 6, 2020

These lines here:

//broadcast beg
if (active_lane == laneId)
*warp_buffer = batch.octree.begs[node_idx];
int beg = *warp_buffer;
//broadcast end
if (active_lane == laneId)
*warp_buffer = batch.octree.ends[node_idx];
int end = *warp_buffer;
float3 active_query;
volatile float* warp_buffer_float = (float*)&per_warp_buffer[warpId];
//broadcast warp_query
if (active_lane == laneId)
*warp_buffer_float = query.x;
active_query.x = *warp_buffer_float;
if (active_lane == laneId)
*warp_buffer_float = query.y;
active_query.y = *warp_buffer_float;
if (active_lane == laneId)
*warp_buffer_float = query.z;
active_query.z = *warp_buffer_float;
//broadcast query_index
if (active_lane == laneId)
*warp_buffer = query_index;
float active_query_index = *warp_buffer;

Can also be shared using the shfl instructions.

@@ -222,87 +222,50 @@ namespace pcl { namespace device { namespace knn_search
}

template<int CTA_SIZE>
__device__ __forceinline__ int NearestWarpKernel(const float* points, int points_step, int length, const float3& active_query, float& dist)
__device__ __forceinline__ int NearestWarpKernel(const int beg, const int points_step, const int length, const float3& active_query, float& dist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no out parameters if possible. use return variables. also remove the __forceinline__ from here. this method/function is huge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tidy up indentation in this function. it's looking slightly chaotic. run this function through clang-format.

Copy link
Member

@SergioRAgostinho SergioRAgostinho Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should point_step be renamed to field_step it seems to measuring byte spacing between x, y, and z fields.

float mind2 = dist2[tid];
//find minimum distance among warp threads
constexpr unsigned FULL_MASK = 0xFFFFFFFF;
static_assert(sizeof(KernelPolicy::WARP_SIZE) <= sizeof(unsigned int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is wrong. WARP_SIZE represents the number of bits in the mask. sizeof gives you the number of bytes. you also need an error message here explaining the user what is wrong if the static_assert is violated.

Suggested change
static_assert(sizeof(KernelPolicy::WARP_SIZE) <= sizeof(unsigned int));
static_assert(KernelPolicy::WARP_SIZE <= 8*sizeof(unsigned int));

Comment on lines 252 to 253
float next = __shfl_down_sync(FULL_MASK, dist2, bit_offset);
int next_index = __shfl_down_sync(FULL_MASK, index, bit_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

}
//retrieve index and dist2
index = __shfl_sync(FULL_MASK, index, 0);
dist2 = __shfl_sync(FULL_MASK, dist2, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this temp.

//retrieve index and dist2
index = __shfl_sync(FULL_MASK, index, 0);
dist2 = __shfl_sync(FULL_MASK, dist2, 0);
dist = sqrt(dist2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const (in case it stops being an out parameter)

dist = sqrt(dist2[tid - lane]);
return index[tid - lane];
}
return index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return index;
return std::make_pair(index, dist);

@haritha-j
Copy link
Contributor Author

haritha-j commented Aug 8, 2020

I'm recieving the pair returned from NearestWarpKernel using auto here, which makes the next bit of code a bit less readable, but std::tie doesn't seem to work properly with cuda __device__ functions, and auto [offset, index] is only supported from c++17 from what I read.

                const auto nearestPoint = NearestWarpKernel<KernelPolicy::CTA_SIZE>(beg, batch.points_step, end - beg, active_query);

                if (active_lane == laneId)
                    if (min_distance > nearestPoint.second)
                    {
                       min_distance = nearestPoint.second;
                       min_idx = beg + nearestPoint.first;
                    }

@kunaltyagi
Copy link
Member

kunaltyagi commented Aug 9, 2020

If the return type doesn't matter to the API/ABI (or explicitly needs to be given no API/ABI guarantee), you can use the following super-legal code

auto myFunction(int a, double b) -> decltype(auto) {
  struct GuessTheReturnType {
    bool correct = false;
    float magic = 4.2e1;
  };
  return GuessTheReturnType{a, b};
}

// usage
auto& val = myFunction(9, 33.);
if (val.correct) { return val.magic; }

@SergioRAgostinho
Copy link
Member

@larshg please merge in case you're ok with the changes.

gpu/octree/src/cuda/knn_search.cu Outdated Show resolved Hide resolved
@larshg
Copy link
Contributor

larshg commented Aug 10, 2020

https://github.com/PointCloudLibrary/pcl/blob/master/cmake/pcl_find_cuda.cmake

This one should probably also be revised soon. Not sure what cuda version is required for __shfl_sync instructions etc.

Right now it adds many cuda arch versions and with a new configure, it added 3.0 etch which failed to build.

@SergioRAgostinho
Copy link
Member

https://github.com/PointCloudLibrary/pcl/blob/master/cmake/pcl_find_cuda.cmake

This one should probably also be revised soon. Not sure what cuda version is required for __shfl_sync instructions etc.

Right now it adds many cuda arch versions and with a new configure, it added 3.0 etch which failed to build.

Good catch. They were introduced in CUDA 9. CUDA C+14 support was also only introduced in CUDA 9.

@PointCloudLibrary/maintainers any opposition to bump CUDA's minimum version to 9?

@larshg
Copy link
Contributor

larshg commented Aug 11, 2020

Please go ahead and merge @SergioRAgostinho. I don't have those powers 😄

@SergioRAgostinho SergioRAgostinho merged commit c0c0cb2 into PointCloudLibrary:master Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: fix Meta-information for changelog generation module: gpu priority: gsoc Reason for prioritization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants