-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add custom sum tutorial #666
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #666 +/- ##
==========================================
+ Coverage 79.64% 79.69% +0.04%
==========================================
Files 122 122
Lines 7356 7356
==========================================
+ Hits 5859 5862 +3
+ Misses 1497 1494 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Amazing to see it make into this right after the discourse post.
@ozmaden can you review this PR a bit? I would be interested in your feedback. |
Could take a look at it over the weekend. I should warn though that I've got no deep understanding of CUDA though, specifically referring to o.g. CUDA C tutorials etc. |
@jw3126 Sorry for getting back later than anticipated. I read up theory on global vs shared memory, played with a bunch of tutorials and tested your algorithms. On my machine I get the following results for the code as you provided:
I believe there is a certain penalty for the I tried maybe tweaking the blocks and threads could do the work. On this quest I stumbled upon the # Using the same function but with the configuration API
function sum_atomic_config(arr)
out = CUDA.zeros(eltype(arr))
kernel = @cuda launch=false sum_atomic_kernel(out, arr)
config = launch_configuration(kernel.fun)
threads = Base.min(length(arr), config.threads)
blocks = cld(length(arr), threads)
println("threads: $threads, blocks: $blocks")
CUDA.@sync kernel(out, arr; threads=threads, blocks=blocks)
out[]
end Which failed remarkably.
As for the abysmal performance of As for configuring the There also might be some hope in the so called threadFenceReduction CUDA sample. Apparently it solves the issue of two kernels. Stumbled upon it after googling and seeing this stack overflow question. I then searched for it locally amongst the .c .cuh .cu files. That's where I stopped on my journey through the rabbit hole. BTW is there a specific reason why you don't interpolate |
Thanks a lot for looking into this!
On my machine depending on how you choose blocks and threads,
Yes this is far too high. Ideally in the atomic sum, each thread does lots of accumulations "privately" and there are only very few
No particular reason. I usually avoid the |
e293d87
to
885af2c
Compare
a1af3c1
to
46f6109
Compare
I think the current docs do a great job at explaining how to write
map
like algorithms. But I have some trouble with reductions.So I am trying to write this tutorial, to help myself and others.
I would do something like to add some algorithms that use shared memory, like in this tutorial.
But I could not do it using CUDA.jl. I postet my attempt on discourse.