-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Labels
Description
Reductions on CLArrays seem to be almost 100x slower than Base (This is with the pocl CPU backend):
julia> using OpenCL, pocl_jll
julia> X = rand(Float32, 1000, 1000);
julia> X′ = CLArray(X);
julia> @benchmark sum(X; dims = 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 58.380 μs … 922.617 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 90.710 μs ┊ GC (median): 0.00%
Time (mean ± σ): 103.658 μs ± 35.143 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃▁ ▃▇█▃
▁▁▁▂▄█████████▆▆▆▅▅▄▄▃▂▃▃▃▃▄▃▃▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
58.4 μs Histogram: frequency by time 212 μs <
Memory estimate: 4.02 KiB, allocs estimate: 3.
julia> @benchmark OpenCL.synchronize(sum(X′; dims = 1))
BenchmarkTools.Trial: 653 samples with 1 evaluation per sample.
Range (min … max): 5.585 ms … 12.908 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.424 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.604 ms ± 1.126 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▁▂▃▁▄█ ▁▇▇▃▄▄▄▃▄▂▃▂▃▁▁▃▁
▃▂▃▅▆██████████████████████████▇▆█▇▄▇▇█▆▅▇▄▁▄▅▄▄▃▃▅▄▃▂▂▂▂▃ ▅
5.59 ms Histogram: frequency by time 10.6 ms <
Memory estimate: 22.52 KiB, allocs estimate: 247.Is there any low-hanging fruit in terms of optimizations here?