Prefix scan using warp primitives and fallback for non primitive types #287

Ellipse0934 · 2020-07-11T16:43:03Z

No description provided.

src/accumulate.jl

Ellipse0934 · 2020-07-14T13:17:33Z

@maleadt
I've been having some thoughts on this PR, mainly that this does not reflect the true definition of prefix scan. Scan is supposed to work on any binary associative operator and does not require the existence of identity operator which is required for our version. In our case we cannot get functions like xor, min, max to work due to no neutral element.
Outside the scope of Base.accumulate our function should just be as capable as the Thrust version which flexible in this regard.

maleadt · 2020-07-15T15:17:06Z

Yeah, the same argument can be made for mapreduce. Makes it much easier to implement them though, as many textbook GPU implementations of these reductions use + and 0.

Ellipse0934 · 2020-07-16T14:10:51Z

No longer requires a neutral element. It now computes exclusive scan which it didn't earlier.

Requires more tests for the new features but I don't know what kind of tests should I have for this. Should I write a CPU version and compare against that or can I make special cases and check for certain values.

src/accumulate.jl

Ellipse0934 · 2020-07-22T21:10:15Z

Fixed, no races detected anymore. The error was in my usage of @cuDynamicSharedMem.

# new version
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.152922 seconds (978 CPU allocations: 25.609 KiB) (2 GPU allocations: 2.654 MiB, 0.14% gc time of which 94.00% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A);
  0.036084 seconds (616 CPU allocations: 19.922 KiB) (2 GPU allocations: 381.848 KiB, 0.03% gc time)

# current master branch
julia> A = CUDA.ones(Int128, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.077795 seconds (743 CPU allocations: 24.125 KiB) (2 GPU allocations: 1.492 MiB, 0.35% gc time of which 95.43% spent allocating)


julia> A = CUDA.ones(Float32, 100_000_000);

julia> B = similar(A);

julia> CUDA.@time CUDA.scan!(+, B, A, dims=1);
  0.044769 seconds (630 CPU allocations: 20.281 KiB) (2 GPU allocations: 381.848 KiB, 0.33% gc time of which 91.34% spent allocating)

The cost of syncwarp() is too high, yet essential here to establish a memory fence. Maybe the current Belloch scan should be the default fallback. This PR should also needs additional tests for exclusive scan.

Ellipse0934 · 2020-07-27T21:17:49Z

The reason for the poor performance turned out to be missing @inbounds. Now the performance is closer to belloch albeit about 7% slower.

#current master

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  17.81 KiB
  allocs estimate:  555
  --------------
  minimum time:     70.224 ms (0.00% GC)
  median time:      72.690 ms (0.00% GC)
  mean time:        73.293 ms (0.00% GC)
  maximum time:     77.850 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1


# new implementation

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.39 KiB
  allocs estimate:  646
  --------------
  minimum time:     77.801 ms (0.00% GC)
  median time:      78.209 ms (0.00% GC)
  mean time:        79.301 ms (0.00% GC)
  maximum time:     85.532 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

Is is worth it to keep Belloch for this minor improvement ?

How do I write tests for the exclusive version ? Since accumulate does not have an exclusive parameter and the current tests compare against a Base algorithm, should I write a CPU version in test/array.jl and test against that ?

Ellipse0934 · 2020-08-02T13:14:01Z

Did an oopsie and removed the inbounds from aggregate_partial_scan function. Now it's performing much better than the current version for both cases.

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Int128, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  20.27 KiB
  allocs estimate:  638
  --------------
  minimum time:     47.482 ms (0.00% GC)
  median time:      54.718 ms (0.00% GC)
  mean time:        55.158 ms (0.00% GC) # 1.4x faster
  maximum time:     59.364 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

julia> @benchmark CUDA.@sync(CUDA.scan!(+, B, A; dims = 1)) setup=(A = CUDA.ones(Float32, 100_000_000); B = similar(A))
BenchmarkTools.Trial: 
  memory estimate:  19.80 KiB
  allocs estimate:  608
  --------------
  minimum time:     17.683 ms (0.00% GC)
  median time:      19.859 ms (0.00% GC)
  mean time:        19.917 ms (0.00% GC)  # 2x faster
  maximum time:     22.436 ms (0.00% GC)
  --------------
  samples:          152
  evals/sample:     1

maleadt · 2020-08-13T11:58:58Z

I'm seeing lower, but nonetheless nice improvements :-) I'm a bit wary not all corner cases are adequately tested (e.g. exclusive=true), but we can improve that as we go. I've pushed some NFC improvements and a rebase, let's run a final test and merge this:
bors r+

287: Prefix scan using warp primitives and fallback for non primitive types r=maleadt a=Ellipse0934 Co-authored-by: Ellipse0934 <Ellipse0934@github.com> Co-authored-by: Tim Besard <tim.besard@gmail.com>

maleadt · 2020-08-13T12:43:25Z

Actually, let's merge this manually when performance tests on master have finished.
bors r-

bors · 2020-08-13T12:43:26Z

Canceled.

maleadt · 2020-08-13T12:57:13Z

bors try

bors · 2020-08-13T15:06:07Z

try

Build failed:

ci/gitlab/gitlab.com

maleadt · 2020-08-13T15:21:06Z

Well, that's interesting:

  Test threw exception
  Expression: testf((x->begin
            accumulate(+, x)
        end), rand(Int128, n))
  CUDA error: an illegal instruction was encountered (code 715, ERROR_ILLEGAL_INSTRUCTION)

@Ellipse0934 did you run into any of these?

Ellipse0934 · 2020-08-13T18:42:31Z

No, I tested on this branch again to be sure and all tests have passed successfully. My tests: https://gist.github.com/Ellipse0934/2db29bc0602641c4cccd58aadb64fd93

maleadt · 2020-08-14T09:07:29Z

Well, there's definitely something up, although it looks like it only manifests on recent hardware (sm_70 or higher). Looks like kokkos/kokkos#1958, with cuda-memcheck pointing to the warp sync:

array      (2) |        started at 2020-08-14T11:04:40.734
      From worker 2:    ========= CUDA-MEMCHECK
      From worker 2:    ========= Illegal Instruction
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205832__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

Running with synccheck:

      From worker 2:    ========= Barrier error detected. Invalid arguments
      From worker 2:    =========     at 0x00000030 in __cuda_sm70_warpsync
      From worker 2:    =========     by thread (31,0,0) in block (0,0,0)
      From worker 2:    =========     Device Frame:__cuda_sm70_warpsync (__cuda_sm70_warpsync : 0x30)
      From worker 2:    =========     Device Frame:/home/tbesard/Julia/pkg/LLVM/src/interop/base.jl:53:_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE (_Z25julia_partial_scan__205822__13CuDeviceArrayI6Int128Li1E6GlobalES0_IS1_Li1ES2_Ev16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES3_ILi0ES4_ES3_ILi0ES4_ES3_ILi2ES4_IS5_IS6_ES5_IS6_EEEv3ValILitrueEES7_ILifalseEE : 0x3f60)

Ellipse0934 · 2020-08-14T23:20:48Z

src/accumulate.jl

+        value = if shuffle
+            scan_warp(op, value, lane)
+        else
+            cache[thread] = value


Try adding a sync_warp() here. I didn't get any errors wrt sync_check/race_check so I purposely left it out even though it's wrong. After L116

bjarthur · 2021-02-08T16:57:47Z

i'd love to try this PR out as findall is my current bottleneck. are there any plans to resolve the conflicts? i have a variety of GPUs i could help test with (v100, 2080ti, rtx8000, m6000, 980ti, 1080ti).

Ellipse0934 commented Jul 11, 2020

View reviewed changes

src/accumulate.jl Show resolved Hide resolved

kshyatt added the enhancement New feature or request label Jul 11, 2020

vchuravy reviewed Jul 17, 2020

View reviewed changes

src/accumulate.jl Outdated Show resolved Hide resolved

src/accumulate.jl Show resolved Hide resolved

src/accumulate.jl Outdated Show resolved Hide resolved

Ellipse0934 marked this pull request as draft July 23, 2020 10:09

Ellipse0934 marked this pull request as ready for review August 2, 2020 13:11

maleadt force-pushed the prefix_scan branch from 043e483 to 14ee6fd Compare August 13, 2020 11:58

bors bot added a commit that referenced this pull request Aug 13, 2020

Merge #287

24bdb25

287: Prefix scan using warp primitives and fallback for non primitive types r=maleadt a=Ellipse0934 Co-authored-by: Ellipse0934 <Ellipse0934@github.com> Co-authored-by: Tim Besard <tim.besard@gmail.com>

maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Aug 13, 2020

bors bot added a commit that referenced this pull request Aug 13, 2020

Try #287:

b25c99e

Ellipse0934 commented Aug 14, 2020

View reviewed changes

maleadt force-pushed the master branch 2 times, most recently from 91db6b0 to 06fe10b Compare January 8, 2021 12:11

Ellipse0934 added 2 commits February 8, 2021 18:21

Initial prefix scan

9629167

working now

9b0c876

maleadt force-pushed the master branch 8 times, most recently from a1af3c1 to 46f6109 Compare July 29, 2021 13:14

maleadt force-pushed the master branch from 8f1a067 to 6758fca Compare August 13, 2021 18:44

maleadt force-pushed the master branch from b4a4542 to 0e9ad9d Compare October 13, 2021 15:32

maleadt force-pushed the master branch from db0ecb0 to 4d6dbd1 Compare May 13, 2022 07:37

maleadt force-pushed the master branch from 476979e to d53a63e Compare March 16, 2023 12:34

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix scan using warp primitives and fallback for non primitive types #287

Prefix scan using warp primitives and fallback for non primitive types #287

Ellipse0934 commented Jul 11, 2020

Ellipse0934 commented Jul 14, 2020

maleadt commented Jul 15, 2020

Ellipse0934 commented Jul 16, 2020

Ellipse0934 commented Jul 22, 2020

Ellipse0934 commented Jul 27, 2020

Ellipse0934 commented Aug 2, 2020 •

edited

Loading

maleadt commented Aug 13, 2020

maleadt commented Aug 13, 2020

bors bot commented Aug 13, 2020

maleadt commented Aug 13, 2020

bors bot commented Aug 13, 2020

maleadt commented Aug 13, 2020

Ellipse0934 commented Aug 13, 2020 •

edited

Loading

maleadt commented Aug 14, 2020

Ellipse0934 Aug 14, 2020 •

edited

Loading

bjarthur commented Feb 8, 2021

Prefix scan using warp primitives and fallback for non primitive types #287

Are you sure you want to change the base?

Prefix scan using warp primitives and fallback for non primitive types #287

Conversation

Ellipse0934 commented Jul 11, 2020

Ellipse0934 commented Jul 14, 2020

maleadt commented Jul 15, 2020

Ellipse0934 commented Jul 16, 2020

Ellipse0934 commented Jul 22, 2020

Ellipse0934 commented Jul 27, 2020

Ellipse0934 commented Aug 2, 2020 • edited Loading

maleadt commented Aug 13, 2020

maleadt commented Aug 13, 2020

bors bot commented Aug 13, 2020

maleadt commented Aug 13, 2020

bors bot commented Aug 13, 2020

try

maleadt commented Aug 13, 2020

Ellipse0934 commented Aug 13, 2020 • edited Loading

maleadt commented Aug 14, 2020

Ellipse0934 Aug 14, 2020 • edited Loading

Choose a reason for hiding this comment

bjarthur commented Feb 8, 2021

Ellipse0934 commented Aug 2, 2020 •

edited

Loading

Ellipse0934 commented Aug 13, 2020 •

edited

Loading

Ellipse0934 Aug 14, 2020 •

edited

Loading