Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gather #280

Merged
merged 3 commits into from
Mar 12, 2021
Merged

add gather #280

merged 3 commits into from
Mar 12, 2021

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Feb 24, 2021

easiest part of #255

@yuehhua

src/gather.jl Outdated Show resolved Hide resolved
src/gather.jl Outdated Show resolved Hide resolved
@yuehhua
Copy link
Member

yuehhua commented Feb 24, 2021

@chengchingwen You may want to check it.

@CarloLucibello
Copy link
Member Author

this should be ready to go

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
@CarloLucibello
Copy link
Member Author

I'm going to merge this as soon as tests pass so that @yuehhua can continue his work (adding AD rules, adding cuda support).
I'll open a PR for a final api discussion on both gather and scatter

@DhairyaLGandhi
Copy link
Member

I'd expect to do a review of the implementation before merging these operations into NNlib.

@CarloLucibello
Copy link
Member Author

I'd expect to do a review of the implementation before merging these operations into NNlib.

I'll add fake comments in the new pr so that the implementation can be easily reviewed

@DhairyaLGandhi
Copy link
Member

As I see it, the implementation is not following the TF semantics, is that right?

I'd rather have reviews before merging.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Mar 5, 2021

There seems to be a lot of allocating going on, we could alternatively let Julia figure out the indexing for us;

function gather(arr, inds)
  s = size(arr, 1)
  g1 = similar(arr, 2s, size(inds, 1))
  g2 = similar(g1)
  for k  1:2, j  axes(inds,1) # matrix case
    ii = inds[j, k]
    o1 = (k-1)*s
    o2 = (2-k)*s
    for i  axes(arr,1)
      g = arr[i,ii]
      g1[i + o1, j] = g
      g2[i + o2, j] = g
    end
  end
  g1, g2
end

So this version would not allocate quite as much, and we can use column copies instead of row copies (both should be picking up contiguous memory if the cartesian indices line up) which would be more Julian.

Also leads to a sizeable perf difference for small cases (and maybe for large ones). This is only generating the first half, but we can make it do the other as well of course.

julia> src = [3, 4, 5, 6, 7];

julia> index = [1 2 3 4;
                    4 2 1 3;
                    3 5 5 3][:,:,1:1]

julia> @btime gather($(src), $index);
  2.905 μs (37 allocations: 1.11 KiB)

# this one
julia> @btime gather($(src'), $index);
  131.193 ns (2 allocations: 256 bytes)

@CarloLucibello
Copy link
Member Author

As I see it, the implementation is not following the TF semantics, is that right?

I think it does follow TF semantics
https://www.tensorflow.org/api_docs/python/tf/gather
although our implementation is less general (we just have their defaults for batch_dims and axis)

I'd rather have reviews before merging.

yeah, ok, but as usual... please be quicker and avoid blocking other people work all over the place

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Mar 5, 2021

I don't understand what your implementation is doing (and why it is returning 2 things).
and I cannot reproduce your benchmark

julia> function gather2(arr, inds)
  s = size(arr, 1)
  g1 = similar(arr, 2s, size(inds, 1))
  g2 = similar(g1)
  for k  1:2, j  axes(inds,1)
    ii = inds[j, k]
    o1 = (k-1)*s
    o2 = (2-k)*s
    for i  axes(arr,1)
      g = arr[i,ii]
      g1[i + o1, j] = g
      g2[i + o2, j] = g
    end
  end
  g1, g2
end

julia> src = [3, 4, 5, 6, 7];

julia> index = [1 2 3 4;
                    4 2 1 3;
                    3 5 5 3][:,:,1:1];

julia> @btime gather($(src), $index);
  60.446 ns (1 allocation: 176 bytes)

julia> @btime gather2($(src), $index);
ERROR: BoundsError: attempt to access 5-element Vector{Int64} at index [1, 4]
Stacktrace:
  [1] getindex
    @ ./array.jl:802 [inlined]
  [2] gather2(arr::Vector{Int64}, inds::Array{Int64, 3})
    @ Main ./REPL[3]:10
  [3] var"##core#313"(src#310::Vector{Int64}, index#311::Array{Int64, 3})
    @ Main ~/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371

I see just 1 allocation for the implementation here, are you benchmarking the right thing?
Also, your code doesn't appear to work on the general case of d-dimensional src and indx.
If you have anything working and actionable in the short term I can try to add it to this pr, otherwise performance optimization can come later.

src/gather.jl Outdated Show resolved Hide resolved
@DhairyaLGandhi
Copy link
Member

I mentioned in the snippet its for the matrix case, but can easily be extended. Yes - benchmarking the correct thing. Its returning two things because that's how I had needed it for a different operation - sorry should have removed it.

You'd notice you need to transpose src for this version.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Mar 5, 2021

so I used the transpose, and on previous example get this

julia> gather(src,index)
3×4×1 Array{Int64, 3}:
[:, :, 1] =
 3  4  5  6
 6  4  3  5
 5  7  7  5

julia> gather2(src',index)  #not sure how to interpret the output here
([3 6 5; 4 4 7], [4 4 7; 3 6 5])

julia> @btime gather($(src), $index);
  60.218 ns (1 allocation: 176 bytes)

julia> @btime gather2($(src'), $index);
  68.138 ns (2 allocations: 256 bytes)
``

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Mar 5, 2021

For the general case - Julia indexing (it might be better for longer term use to benefit from Base improvements if the perf isn't off which I don't think it is)

function gatherweave2(arr, inds) where {T,N}
  g = arr[:, inds]
  v = reshape(g, length(g),1) # vec(g)
  reshape(v, size(inds)...)
end

Huh, interesting, I definitely can't reproduce the 60 ns number - what Julia version are you on? More generally - I think this approach is fine. Be good to remove the IntOrIntTuple type, though.

@CarloLucibello
Copy link
Member Author

Huh, interesting, I definitely can't reproduce the 60 ns number - what Julia version are you on? More generally - I think this approach is fine

yeah it's julia 1.6, probably they optimized the views.

Be good to remove the IntOrIntTuple type, though.

it's used in a few places, would be inconvenient to type the union each time

For the general case - Julia indexing (it might be better for longer term use to benefit from Base improvements if the perf isn't off which I don't think it is)

I don't know why you have those reshape there, but I can try to use indexing in gather in the case idx is an array of integers instead of falling back to gather!.

@CarloLucibello
Copy link
Member Author

On the usual small example indexing and current gather perform the same

julia> function gather3(src::AbstractArray{Tsrc, Nsrc}, 
                 idx::AbstractArray{<:Integer}) where {Tsrc, Nsrc}
    colons = ntuple(i -> Colon(), Nsrc-1)
    return src[colons..., idx]
end

julia> @btime gather3($(src), $index);
  54.328 ns (1 allocation: 176 bytes)

julia> @btime gather($(src), $index);
  54.847 ns (1 allocation: 176 bytes)

I did a small tweak to gather that shave off a few nanoseconds.

@DhairyaLGandhi
Copy link
Member

We could potentially allow Cartesian indices in idx as well.

@CarloLucibello
Copy link
Member Author

We could potentially allow Cartesian indices in idx as well.

we could. It would mean to clutter the code a bit though since you can't splat idx[k]... for cartesian indexes. I can't think o a concrete use case for it, so maybe we should postpone.

Looking at the tensorflow implementation,
https://www.tensorflow.org/api_docs/python/tf/gather
the main missing things seem to be the axis and batch_dims keywords. Those are not hard to implement, but we already have anything we need for FluxGeometric and more. Since a few pieces, AD and gpu, still have to fall in place, I'd say let's merge this PR as it is, then do AD and gpu, and once everything has landed, look at the big picture and generalize

@CarloLucibello
Copy link
Member Author

bump

@CarloLucibello
Copy link
Member Author

@yuehhua I'm merging this. Could you file a PR with the AD rules and another one with the gpu kernels?
The gpu kernels can go in the newly created NNlibCUDA.jl in this repo (see #286)

@CarloLucibello CarloLucibello merged commit cf8cba0 into master Mar 12, 2021
@yuehhua
Copy link
Member

yuehhua commented Mar 31, 2021

We could potentially allow Cartesian indices in idx as well.

@DhairyaLGandhi The suggestion is great! For performance, we can transform idx from Integer type or Tuple type to into Cartesian indices internally. It may reduce the need for storing indices in single integer, instead of multiple integers. For interfaces, I think it is still good to accept integer indices or tuple indices for a function call from user. It is more intuitive for user.

@DhairyaLGandhi
Copy link
Member

From an api pov, that's correct. But if we generalise it to use Cartesian, we can accept both and have a common implementation internally which would be cleaner too.

@yuehhua
Copy link
Member

yuehhua commented Mar 31, 2021

Oh! Sure!😊 We can do that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants