Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather for CUDA support #8

Merged
merged 1 commit into from
May 24, 2021
Merged

Gather for CUDA support #8

merged 1 commit into from
May 24, 2021

Conversation

yuehhua
Copy link
Member

@yuehhua yuehhua commented May 3, 2021

No description provided.

src/gather.jl Outdated Show resolved Hide resolved
src/gather.jl Outdated Show resolved Hide resolved
Update src/gather.jl

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
replace _check_dims as gather_check_dims

fix

update manifest.toml
@CarloLucibello
Copy link
Member

I see that in ScatterNNlib the cuda implementation is the same as the cpu one. Wasn't it performing scalar indexin though? Did you compare performance with the implementation here?

@yuehhua
Copy link
Member Author

yuehhua commented May 23, 2021

Yeah, they have the same implementation. The key difference here is to put the scalar indexing operation into CUDA kernel. The kernel is crucial to leverage the threads in cuda to accelerate computation. The scalar indexing warning is to index cuarray outside the kernel. I can surely provide some benchmarks between two.

@yuehhua
Copy link
Member Author

yuehhua commented May 23, 2021

If I benchmark over cpu version but with cuarray, the scalar operations warning is triggered.

julia> using NNlib

julia> using BenchmarkTools

julia> using CUDA

julia> T = Float32
Float32

julia> src = cu(T[3, 4, 5, 6, 7]);

julia> index = cu(rand([1,2,3,4,5], 1000, 100));

julia> output = cu(zeros(T, 1000, 100));

julia> NNlib.gather!(output, src, index);
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/4n0iS/src/host/indexing.jl:64

julia> @benchmark NNlib.gather!($output, $src, $index)
BenchmarkTools.Trial: 
  memory estimate:  42.72 MiB
  allocs estimate:  1000000
  --------------
  minimum time:     1.413 s (0.82% GC)
  median time:      1.481 s (1.10% GC)
  mean time:        1.498 s (1.79% GC)
  maximum time:     1.616 s (4.65% GC)
  --------------
  samples:          4
  evals/sample:     1

If we have the same setting but use cuda version gather, we have better performance.

julia> using NNlib, NNlibCUDA

julia> using BenchmarkTools

julia> using CUDA

julia> T = Float32
Float32

julia> src = cu(T[3, 4, 5, 6, 7]);

julia> index = cu(rand([1,2,3,4,5], 1000, 100));

julia> output = cu(zeros(T, 1000, 100));

julia> NNlib.gather!(output, src, index);

julia> @benchmark NNlib.gather!($output, $src, $index)
BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  10
  --------------
  minimum time:     1.713 μs (0.00% GC)
  median time:      2.633 μs (0.00% GC)
  mean time:        2.600 μs (0.00% GC)
  maximum time:     10.720 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

The scale difference is 1.5 s and 2.6 μs.

@CarloLucibello
Copy link
Member

I just don't understand why the gather gpu benchmark in ScatterNNlib shows good performance while the code I see here
https://github.com/yuehhua/ScatterNNlib.jl/blob/master/src/cuda/gather.jl
performs scalar indexing so should be slow

@CarloLucibello
Copy link
Member

anyways, this pr looks good, merging it

@CarloLucibello CarloLucibello merged commit 2a4b220 into FluxML:master May 24, 2021
@CarloLucibello
Copy link
Member

@yuehhua are there any more missing features preventing NNlib's scatter/gather from being used in GeometricFlux?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants