-
Notifications
You must be signed in to change notification settings - Fork 241
Prefix scan using warp primitives and fallback for non primitive types #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@maleadt |
Yeah, the same argument can be made for |
No longer requires a neutral element. It now computes exclusive scan which it didn't earlier. Requires more tests for the new features but I don't know what kind of tests should I have for this. Should I write a CPU version and compare against that or can I make special cases and check for certain values. |
Fixed, no races detected anymore. The error was in my usage of # new version
julia> A = CUDA.ones(Int128, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.152922 seconds (978 CPU allocations: 25.609 KiB) (2 GPU allocations: 2.654 MiB, 0.14% gc time of which 94.00% spent allocating)
julia> A = CUDA.ones(Float32, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A);
0.036084 seconds (616 CPU allocations: 19.922 KiB) (2 GPU allocations: 381.848 KiB, 0.03% gc time) # current master branch
julia> A = CUDA.ones(Int128, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.077795 seconds (743 CPU allocations: 24.125 KiB) (2 GPU allocations: 1.492 MiB, 0.35% gc time of which 95.43% spent allocating)
julia> A = CUDA.ones(Float32, 100_000_000);
julia> B = similar(A);
julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
0.044769 seconds (630 CPU allocations: 20.281 KiB) (2 GPU allocations: 381.848 KiB, 0.33% gc time of which 91.34% spent allocating) The cost of |
The reason for the poor performance turned out to be missing
Is is worth it to keep Belloch for this minor improvement ? How do I write tests for the exclusive version ? Since |
Did an oopsie and removed the inbounds from
|
I'm seeing lower, but nonetheless nice improvements :-) I'm a bit wary not all corner cases are adequately tested (e.g. exclusive=true), but we can improve that as we go. I've pushed some NFC improvements and a rebase, let's run a final test and merge this: |
Actually, let's merge this manually when performance tests on |
Canceled. |
bors try |
tryBuild failed: |
Well, that's interesting:
@Ellipse0934 did you run into any of these? |
No, I tested on this branch again to be sure and all tests have passed successfully. My tests: https://gist.github.com/Ellipse0934/2db29bc0602641c4cccd58aadb64fd93 |
Well, there's definitely something up, although it looks like it only manifests on recent hardware (sm_70 or higher). Looks like kokkos/kokkos#1958, with cuda-memcheck pointing to the warp sync:
Running with synccheck:
|
value = if shuffle | ||
scan_warp(op, value, lane) | ||
else | ||
cache[thread] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try adding a sync_warp()
here. I didn't get any errors wrt sync_check/race_check so I purposely left it out even though it's wrong. After L116
91db6b0
to
06fe10b
Compare
i'd love to try this PR out as |
a1af3c1
to
46f6109
Compare
5d585c4
to
c850163
Compare
No description provided.