Skip to content

Prefix scan using warp primitives and fallback for non primitive types #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Ellipse0934
Copy link
Contributor

No description provided.

@kshyatt kshyatt added the enhancement New feature or request label Jul 11, 2020
@Ellipse0934
Copy link
Contributor Author

@maleadt
I've been having some thoughts on this PR, mainly that this does not reflect the true definition of prefix scan. Scan is supposed to work on any binary associative operator and does not require the existence of identity operator which is required for our version. In our case we cannot get functions like xor, min, max to work due to no neutral element.
Outside the scope of Base.accumulate our function should just be as capable as the Thrust version which flexible in this regard.

@maleadt
Copy link
Member

maleadt commented Jul 15, 2020

Yeah, the same argument can be made for mapreduce. Makes it much easier to implement them though, as many textbook GPU implementations of these reductions use + and 0.

@Ellipse0934
Copy link
Contributor Author

No longer requires a neutral element. It now computes exclusive scan which it didn't earlier.

Requires more tests for the new features but I don't know what kind of tests should I have for this. Should I write a CPU version and compare against that or can I make special cases and check for certain values.

@Ellipse0934
Copy link
Contributor Author

Fixed, no races detected anymore. The error was in my usage of @cuDynamicSharedMem.

# new version
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.152922 seconds (978 CPU allocations: 25.609 KiB) (2 GPU allocations: 2.654 MiB, 0.14% gc time of which 94.00% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A);
  0.036084 seconds (616 CPU allocations: 19.922 KiB) (2 GPU allocations: 381.848 KiB, 0.03% gc time)
# current master branch
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.077795 seconds (743 CPU allocations: 24.125 KiB) (2 GPU allocations: 1.492 MiB, 0.35% gc time of which 95.43% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.044769 seconds (630 CPU allocations: 20.281 KiB) (2 GPU allocations: 381.848 KiB, 0.33% gc time of which 91.34% spent allocating)

The cost of syncwarp() is too high, yet essential here to establish a memory fence. Maybe the current Belloch scan should be the default fallback. This PR should also needs additional tests for exclusive scan.

@Ellipse0934 Ellipse0934 marked this pull request as draft July 23, 2020 10:09
@Ellipse0934
Copy link
Contributor Author

The reason for the poor performance turned out to be missing @inbounds. Now the performance is closer to belloch albeit about 7% slower.

#current master

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  17.81 KiB
  allocs estimate:  555
  --------------
  minimum time:     70.224 ms (0.00% GC)
  median time:      72.690 ms (0.00% GC)
  mean time:        73.293 ms (0.00% GC)
  maximum time:     77.850 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1


# new implementation

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.39 KiB
  allocs estimate:  646
  --------------
  minimum time:     77.801 ms (0.00% GC)
  median time:      78.209 ms (0.00% GC)
  mean time:        79.301 ms (0.00% GC)
  maximum time:     85.532 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

Is is worth it to keep Belloch for this minor improvement ?

How do I write tests for the exclusive version ? Since accumulate does not have an exclusive parameter and the current tests compare against a Base algorithm, should I write a CPU version in test/array.jl and test against that ?

@Ellipse0934 Ellipse0934 marked this pull request as ready for review August 2, 2020 13:11
@Ellipse0934
Copy link
Contributor Author

Ellipse0934 commented Aug 2, 2020

Did an oopsie and removed the inbounds from aggregate_partial_scan function. Now it's performing much better than the current version for both cases.

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.27 KiB
  allocs estimate:  638
  --------------
  minimum time:     47.482 ms (0.00% GC)
  median time:      54.718 ms (0.00% GC)
  mean time:        55.158 ms (0.00% GC) # 1.4x faster
  maximum time:     59.364 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Float32, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  19.80 KiB
  allocs estimate:  608
  --------------
  minimum time:     17.683 ms (0.00% GC)
  median time:      19.859 ms (0.00% GC)
  mean time:        19.917 ms (0.00% GC)  # 2x faster
  maximum time:     22.436 ms (0.00% GC)
  --------------
  samples:          152
  evals/sample:     1

@maleadt
Copy link
Member

maleadt commented Aug 13, 2020

I'm seeing lower, but nonetheless nice improvements :-) I'm a bit wary not all corner cases are adequately tested (e.g. exclusive=true), but we can improve that as we go. I've pushed some NFC improvements and a rebase, let's run a final test and merge this:
bors r+

bors bot added a commit that referenced this pull request Aug 13, 2020
287: Prefix scan using warp primitives and fallback for non primitive types r=maleadt a=Ellipse0934



Co-authored-by: Ellipse0934 <Ellipse0934@github.com>
Co-authored-by: Tim Besard <tim.besard@gmail.com>
@maleadt maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Aug 13, 2020
@maleadt
Copy link
Member

maleadt commented Aug 13, 2020

Actually, let's merge this manually when performance tests on master have finished.
bors r-

@bors
Copy link
Contributor

bors bot commented Aug 13, 2020

Canceled.

@maleadt
Copy link
Member

maleadt commented Aug 13, 2020

bors try

bors bot added a commit that referenced this pull request Aug 13, 2020
@bors
Copy link
Contributor

bors bot commented Aug 13, 2020

try

Build failed:

@maleadt
Copy link
Member

maleadt commented Aug 13, 2020

Well, that's interesting:

  Test threw exception
  Expression: testf((x->begin
            accumulate(+, x)
        end), rand(Int128, n))
  CUDA error: an illegal instruction was encountered (code 715, ERROR_ILLEGAL_INSTRUCTION)

@Ellipse0934 did you run into any of these?

@Ellipse0934
Copy link
Contributor Author

Ellipse0934 commented Aug 13, 2020

No, I tested on this branch again to be sure and all tests have passed successfully. My tests: https://gist.github.com/Ellipse0934/2db29bc0602641c4cccd58aadb64fd93

@maleadt
Copy link
Member

maleadt commented Aug 14, 2020

Well, there's definitely something up, although it looks like it only manifests on recent hardware (sm_70 or higher). Looks like kokkos/kokkos#1958, with cuda-memcheck pointing to the warp sync:

array      (2) |        started at 2020-08-14T11:04:40.734
      From worker 2:    ========= CUDA-MEMCHECK
      From worker 2:    ========= Illegal Instruction
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

Running with synccheck:

      From worker 2:    ========= Barrier error detected. Invalid arguments
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:__cuda_sm70_warpsync (__cuda_sm70_warpsync : 0x30)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

value = if shuffle
scan_warp(op, value, lane)
else
cache[thread] = value
Copy link
Contributor Author

@Ellipse0934 Ellipse0934 Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try adding a sync_warp() here. I didn't get any errors wrt sync_check/race_check so I purposely left it out even though it's wrong. After L116

@maleadt maleadt force-pushed the master branch 2 times, most recently from 91db6b0 to 06fe10b Compare January 8, 2021 12:11
@bjarthur
Copy link
Contributor

bjarthur commented Feb 8, 2021

i'd love to try this PR out as findall is my current bottleneck. are there any plans to resolve the conflicts? i have a variety of GPUs i could help test with (v100, 2080ti, rtx8000, m6000, 980ti, 1080ti).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants