Open
Description
In /cpp/src_prims/common/grid_sync.cuh only the masterThread() executes a threadfence (line 191).
As per CUDA documentation, threadfence semantics are only guaranteed for the calling thread.
For other threads, there is no guarantee that global writes are visible to all threads.
The threadfence at line 192 should be moved out of the if condition to line 189.
An example can be seen in CUB's equivalent implementation of a grid barrier.
https://github.com/NVIDIA/cub/blob/main/cub/grid/grid_barrier.cuh
You'll notice a threadfence at line 78 executed by all threads participating in the barrier.
The comment at line 76 also confirms that this fence is to ensure the visibility of global writes.