Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable automatic barriers #484

Closed
stgeke opened this issue Feb 24, 2021 · 9 comments · Fixed by #544
Closed

Add option to disable automatic barriers #484

stgeke opened this issue Feb 24, 2021 · 9 comments · Fixed by #544
Labels
feature Use this label to request a new feature!

Comments

@stgeke
Copy link
Contributor

stgeke commented Feb 24, 2021

No description provided.

@kris-rowe
Copy link
Member

To clarify the intended usage: are you looking for a way to disable all automatic barriers for a kernel—e.g., by passing kernel properties, or specific automatic barriers within a kernel—e.g., using an attribute like @nobarrier similar to the nowait directive in OpenMP?

@stgeke
Copy link
Contributor Author

stgeke commented Sep 14, 2021

Currently a @barrier is added after an inner block. It would be nice to have an option to disable this.

@kris-rowe
Copy link
Member

Correct, I was curious which type of mechanism would be most useful in your case for stopping this.

Decorating the outermost @inner loop with an attribute like @nobarrier (e.g., either at the loop definition, or immediately after the loop) would provide the most fine-grain control, since other inner blocks within the same kernel would still have barriers inserted. Additionally, it would satisfy the principle-of-least-surprise for anyone else reading the kernel.

Passing a flag through the kernel properties would be convenient if the programmer wanted to disable barriers for a kernel with several inner blocks, however it wouldn't be obvious to anyone reading the untranslated kernel source that this was happening.

@noelchalmers
Copy link
Contributor

A small clarification is that a barrier is added automatically after an inner block only if that block used shmem at all, and it's not the last inner block.

We did used to have the option to disable auto barrier addition, but I agree with Kris that this probably isn't ideal since it's a pretty heavy toggle to set in entire okl file(s). These days I just fuse inner blocks when I dont want the barrier.

@stgeke
Copy link
Contributor Author

stgeke commented Sep 14, 2021

A related issue: If the inner size <= warpSize a warp-wide barrier should be added. Currently no @barrier is added at all. That's tricky at least for Nvidia's Volta and later architectures (you can no longer assume that the threads in a wrap run in lock-step).

@noelchalmers
Copy link
Contributor

Can you give an example of this? Do you have individual lanes of the warp trying to communicate through global memory? I haven't seen any use for __syncwarp aside from that scenario.

The normal __syncthreads is identical to __syncwarp when inner size <= warp size.

@stgeke
Copy link
Contributor Author

stgeke commented Sep 14, 2021

My bigger concern is that at the moment no barriers are added at all. Maybe I recall incorrectly?
Isn't __syncwarp faster than __syncthreads?

@noelchalmers
Copy link
Contributor

There's no syncwarps added currently, that's correct. But a barrier like that should only be added when such a barrier is needed. I'm curious where specifically you think the barrier is needed. Right now you could obviously just rely on splitting inner blocks and getting coherency through the usual syncthreads.

Is syncwarp faster than syncthreads? Depends on the usage. They're likely comparable in time if you have to wait on global mem fences. If the threadblock is truly made of warps that dont share data with one another (so syncthreads isnt needed), but do share data between the lanes of the warp, then yes there's probably opportunity to progress some warps while barriering others. I dont think that's common, however. Is that what you need to happen?

@kris-rowe
Copy link
Member

A related issue: If the inner size <= warpSize a warp-wide barrier should be added. Currently no @barrier is added at all. That's tricky at least for Nvidia's Volta and later architectures (you can no longer assume that the threads in a wrap run in lock-step).

This is also relevant for OpenCL and SYCL/DPC++ since the innermost @inner loop will be mapped to a sub-group. The new versions of the standards support sub-group barriers. I have opened a separate issue (#516 ) for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Use this label to request a new feature!
Projects
None yet
3 participants