-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather for CUDA support #8
Conversation
Update src/gather.jl Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com> replace _check_dims as gather_check_dims fix update manifest.toml
I see that in ScatterNNlib the cuda implementation is the same as the cpu one. Wasn't it performing scalar indexin though? Did you compare performance with the implementation here? |
Yeah, they have the same implementation. The key difference here is to put the scalar indexing operation into CUDA kernel. The kernel is crucial to leverage the threads in cuda to accelerate computation. The scalar indexing warning is to index cuarray outside the kernel. I can surely provide some benchmarks between two. |
If I benchmark over cpu version but with cuarray, the scalar operations warning is triggered. julia> using NNlib
julia> using BenchmarkTools
julia> using CUDA
julia> T = Float32
Float32
julia> src = cu(T[3, 4, 5, 6, 7]);
julia> index = cu(rand([1,2,3,4,5], 1000, 100));
julia> output = cu(zeros(T, 1000, 100));
julia> NNlib.gather!(output, src, index);
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/4n0iS/src/host/indexing.jl:64
julia> @benchmark NNlib.gather!($output, $src, $index)
BenchmarkTools.Trial:
memory estimate: 42.72 MiB
allocs estimate: 1000000
--------------
minimum time: 1.413 s (0.82% GC)
median time: 1.481 s (1.10% GC)
mean time: 1.498 s (1.79% GC)
maximum time: 1.616 s (4.65% GC)
--------------
samples: 4
evals/sample: 1 If we have the same setting but use cuda version julia> using NNlib, NNlibCUDA
julia> using BenchmarkTools
julia> using CUDA
julia> T = Float32
Float32
julia> src = cu(T[3, 4, 5, 6, 7]);
julia> index = cu(rand([1,2,3,4,5], 1000, 100));
julia> output = cu(zeros(T, 1000, 100));
julia> NNlib.gather!(output, src, index);
julia> @benchmark NNlib.gather!($output, $src, $index)
BenchmarkTools.Trial:
memory estimate: 272 bytes
allocs estimate: 10
--------------
minimum time: 1.713 μs (0.00% GC)
median time: 2.633 μs (0.00% GC)
mean time: 2.600 μs (0.00% GC)
maximum time: 10.720 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10 The scale difference is 1.5 s and 2.6 μs. |
I just don't understand why the gather gpu benchmark in ScatterNNlib shows good performance while the code I see here |
anyways, this pr looks good, merging it |
@yuehhua are there any more missing features preventing NNlib's scatter/gather from being used in GeometricFlux? |
No description provided.