Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add custom sum tutorial #666

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

jw3126
Copy link
Contributor

@jw3126 jw3126 commented Jan 19, 2021

I think the current docs do a great job at explaining how to write map like algorithms. But I have some trouble with reductions.
So I am trying to write this tutorial, to help myself and others.
I would do something like to add some algorithms that use shared memory, like in this tutorial.
But I could not do it using CUDA.jl. I postet my attempt on discourse.

@codecov
Copy link

codecov bot commented Jan 19, 2021

Codecov Report

Merging #666 (b72df2d) into master (6107a08) will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #666      +/-   ##
==========================================
+ Coverage   79.64%   79.69%   +0.04%     
==========================================
  Files         122      122              
  Lines        7356     7356              
==========================================
+ Hits         5859     5862       +3     
+ Misses       1497     1494       -3     
Impacted Files Coverage Δ
lib/cudadrv/memory.jl 82.51% <0.00%> (+0.44%) ⬆️
lib/curand/random.jl 92.85% <0.00%> (+2.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6107a08...b72df2d. Read the comment docs.

Copy link

@coezmaden coezmaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Amazing to see it make into this right after the discourse post.

@jw3126
Copy link
Contributor Author

jw3126 commented Feb 8, 2021

@ozmaden can you review this PR a bit? I would be interested in your feedback.

@coezmaden
Copy link

@ozmaden can you review this PR a bit? I would be interested in your feedback.

Could take a look at it over the weekend. I should warn though that I've got no deep understanding of CUDA though, specifically referring to o.g. CUDA C tutorials etc.

@coezmaden
Copy link

@jw3126 Sorry for getting back later than anticipated. I read up theory on global vs shared memory, played with a bunch of tutorials and tested your algorithms. On my machine I get the following results for the code as you provided:

CUDA.sum sum_baseline sum_atomic sum_shmem
112.000 μs 1.918 ms 156.700 μs 212.200 μs

I believe there is a certain penalty for the sum_shmem coming from the fact that it has to launch two kernels consecutively. At least on my machine that is the case. So quite surprisingly sum_atomic wins...?

I tried maybe tweaking the blocks and threads could do the work. On this quest I stumbled upon the launch_configuration function as discussed on Discourse here. So I wrote a following sum_atomic_config:.

# Using the same function but with the configuration API
function sum_atomic_config(arr)
    out = CUDA.zeros(eltype(arr))
    kernel = @cuda launch=false sum_atomic_kernel(out, arr)
    config = launch_configuration(kernel.fun)
    threads = Base.min(length(arr), config.threads)
    blocks = cld(length(arr), threads)
    println("threads: $threads, blocks: $blocks")
    CUDA.@sync kernel(out, arr; threads=threads, blocks=blocks)
    out[]
end

Which failed remarkably.

CUDA.sum sum_atomic sum_atomic_config
112.000 μs 156.700 μs 1.925 ms

As for the abysmal performance of sum_atomic_config: in the og Discourse post it is mentioned that launch_configuration provides the safest upper bound to the threads and blocks. On my machine I get: "threads: 256, blocks: 3907", which could be too high.

As for configuring the sum_shmem: I couldn't quite wrap my head around it. Because of the shmem blocks and threads have to be known beforehand and the configuration doesn't really work since it has to come first. But then shmem has to be allocated first. So it becomes a chicken or egg problem.

There also might be some hope in the so called threadFenceReduction CUDA sample. Apparently it solves the issue of two kernels. Stumbled upon it after googling and seeing this stack overflow question. I then searched for it locally amongst the .c .cuh .cu files. That's where I stopped on my journey through the rabbit hole.

BTW is there a specific reason why you don't interpolate @btime inputs with a $? I noticed a slight increase with the $. Might be insignificant since in the order of ~10us. Also fyi I use Windows 10 and have GeForce GTX 1050 Ti.

@jw3126
Copy link
Contributor Author

jw3126 commented Feb 20, 2021

@jw3126 Sorry for getting back later than anticipated. I read up theory on global vs shared memory, played with a bunch of tutorials and tested your algorithms.

Thanks a lot for looking into this!

On my machine I get the following results for the code as you provided:
CUDA.sum sum_baseline sum_atomic sum_shmem
112.000 μs 1.918 ms 156.700 μs 212.200 μs

I believe there is a certain penalty for the sum_shmem coming from the fact that it has to launch two kernels consecutively. At least on my machine that is the case. So quite surprisingly sum_atomic wins...?

I tried maybe tweaking the blocks and threads could do the work.

On my machine depending on how you choose blocks and threads, sum_atomic can also be faster. But yeah I don't really understand it. Also my impression was that a kernel launch is much cheaper then 50mus.

Which failed remarkably.
CUDA.sum sum_atomic sum_atomic_config
112.000 μs 156.700 μs 1.925 ms

As for the abysmal performance of sum_atomic_config: in the og Discourse post it is mentioned that launch_configuration provides the safest upper bound to the threads and blocks. On my machine I get: "threads: 256, blocks: 3907", which could be too high.

Yes this is far too high. Ideally in the atomic sum, each thread does lots of accumulations "privately" and there are only very few atomic_add calls. With 4k blocks, there is lots of fighting for doing the atomic_add

BTW is there a specific reason why you don't interpolate @btime inputs with a $? I noticed a slight increase with the $. Might be insignificant since in the order of ~10us. Also fyi I use Windows 10 and have GeForce GTX 1050

No particular reason. I usually avoid the $ if I think call overhead should be negligible.

@kshyatt kshyatt added the documentation Improvements or additions to documentation label Mar 19, 2021
@maleadt maleadt force-pushed the master branch 19 times, most recently from e293d87 to 885af2c Compare July 29, 2021 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants