Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use view for RNN gate slice extraction #1761

Merged
merged 3 commits into from
Jan 16, 2022
Merged

Use view for RNN gate slice extraction #1761

merged 3 commits into from
Jan 16, 2022

Conversation

ToucheSir
Copy link
Member

@ToucheSir ToucheSir commented Nov 4, 2021

This was originally passed over in #907, but I don't find the argument in that PR particularly compelling as the return value is only ever used once. Any negative impact on caching is going to happen anyhow during the slice materialization, so we might as well just let the subsequent fused broadcasts handle said materialization for us while reducing allocations.

Pinging @jeremiedb, @sdobber and @mkschleg if they have any interesting benchmarks to run this on. Otherwise I'll try to get something working with https://github.com/FluxML/Flux.jl/blob/master/perf/bench_utils.jl locally.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • [N/A] Documentation, if applicable
  • [N/A] API changes require approval from a committer (different from the author, if applicable)

@sdobber
Copy link
Contributor

sdobber commented Nov 4, 2021

@ToucheSir I'm happy to run some benchmarks on this change, but I'm out traveling at the moment, so it will be a few days before I can get to that.

@jeremiedb
Copy link
Contributor

@ToucheSir I ran some test mid-size LSTMs, 2 layers, 64 neurons, 256 batch size.

Although modest, the proposal looks to bring some performance benefits:

CPU
master:
68.950850 seconds (7.97 M allocations: 62.804 GiB, 34.83% gc time)

this PR:
63.452862 seconds (8.97 M allocations: 59.036 GiB, 31.14% gc time)

GPU
master:
12.102949 seconds (26.59 M allocations: 1.690 GiB, 3.34% gc time)

this PR:
11.820140 seconds (26.77 M allocations: 1.641 GiB, 3.08% gc time)

I'm therefore all in to adopt this change.

@ToucheSir
Copy link
Member Author

ToucheSir commented Nov 6, 2021

Ok, my benchmarks are in too. The code, logs and raw data are all available at https://gist.github.com/ToucheSir/2fecbe99e5b304fed11e25e42c535cc3. Because there are so many dimensions to this, I've also created an interactive figure (partially for my own sanity) here for folks to peruse.

TL;DR: it's complicated. Edit: less complicated now, apparently Zygote hates the control flow introduced by @view and switching to view changed things dramatically. TL;DR there is a measurable benefit for GPU, and CPU is a wash.

Here is the best-performing configuration (LSTM on GPU):
lstm_gpu

And here is the worst (GRU on CPU):
gru_cpu

Breaking things down further, almost all of the performance gains come from a much more efficient forward pass. This includes a noticeable reduction in both allocation count and memory allocated. Conversely, any performance loss seems to be caused by a lower backwards pass. However, I'm not sure why this PR would be slower (sometimes, not always!) there. The allocations (count and KB) are identical across both versions, which makes sense given they use the exact same pullback. @mcabbott did you ever encounter this while working on getindex or accum PRs for Zygote?

This was originally passed over in #907, but I don't find that argument particularly compelling as the return value is only ever used once. Any negative impact on caching is going to happen anyhow during the slice materialization, so we might as well just let the subsequent fused broadcasts handle said materialization for us while reducing allocations.
@mcabbott
Copy link
Member

mcabbott commented Nov 6, 2021

I don't know why @view would slow down the backward pass, I would expect no change.

The backward is going to allocate a ton, since it'll make a full copy (mostly zero) for every time you take a slice, and another to add each pair of those with accum. The second you could avoid with something like https://github.com/FluxML/Zygote.jl/pull/981/files#diff-cd0210083ce3136f79bee6ebca2bcca77f41a14f11b5a7a65ea1cc54803164c3R27-R34 (as an experiment, I mean! This is not safe in isolation.)

The first is harder. One way might be to replace individual calls to gate(x::AbstractMatrix, h, n) with one call that makes all of them. That could act like eachcol and allocate just one similar(x) to write everything into.

@ToucheSir
Copy link
Member Author

ToucheSir commented Nov 6, 2021

x = rand(2, 3);

julia> Meta.@lower @view x[:, 2]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─      goto #3 if not true
2 ─ %2 = (view)(x, :, 2)
└──      return %2
3 ─      return false
))))

f(x) = sum(@view x[:, 1])

julia> @code_adjoint f(rand(3, 4))
Zygote.Adjoint(1: (%3, %4 :: Zygote.Context, %1, %2)
  br 3 unless true
2:
  %5 = Zygote._pullback(%4, view, %2, Main.:(:), 1)
  %6 = Base.getindex(%5, 1)
  %7 = Base.getindex(%5, 2)
  br 4 (%6, 1)
3:
  br 4 (false, 2)
4: (%8, %12 :: UInt8)
  %9 = Zygote._pullback(%4, Main.sum, %8)
  %10 = Base.getindex(%9, 1)
  %11 = Base.getindex(%9, 2)
  return %10, 1: (%1)
  %2 = @12 !== 0x01
  %3 = (@11)(%1)
  %4 = Zygote.gradindex(%3, 2)
  br 3 unless %2
  br 2
2:
  br 4 (nothing)
3:
  %5 = (@7)(%4)
  %6 = Zygote.gradindex(%5, 2)
  br 4 (%6)
4: (%7)
  %8 = Zygote.tuple(nothing, %7)
  return %8)

I'm not sure either, perhaps more work for the compiler? Profiling showed that the lion's share of time was spent in the generated Pullback function, but I thought that compilation would be cached?

One way might be to replace individual calls to gate(x::AbstractMatrix, h, n) with one call that makes all of them.

I was thinking about that yesterday as well. This monolithic kernel approach has been successfully done before, but it would require some reorganizing of the RNN cell internals to work. Having one in-place function and corresponding rrule (+ GPU methods) for all but the matmuls in https://github.com/FluxML/Flux.jl/blob/master/src/layers/recurrent.jl#L159-L165 would be very cool.

@mcabbott
Copy link
Member

mcabbott commented Nov 6, 2021

If I'm reading correctly, this gate thing takes non-overlapping blocks along the first dimesion. So the simplest fused way might be just (untested!):

multigate(x::AbstractArray, h::Int) = ntuple(n -> gate(x,h,n), 4)

@adjoint multigate(x::AbstractArray, h::Int) =
  multigate(x, h), dy -> (vcat(dy...), nothing)

and then

  four = multigate(g, o)
  input = σ.(four[1])
  forget = σ.(four[2])
  cell = tanh.(four[3])
  output = σ.(four[4])

Agree there's scope for much more fusing. Like dense_bias_act, but more so.

@mcabbott
Copy link
Member

mcabbott commented Nov 7, 2021

Do the two matrix multiplications h = σ.(Wi*x .+ Wh*h .+ b) produce something the same size? Not e.g. matrix .+ vector I mean.

If so, then this is one way you could fuse things: https://github.com/FluxML/NNlib.jl/pull/346/files#diff-0e3febb41064ef9f892dd2b52c708e0497c548275cd543ec3f95cb947fa08b1cR6-R22

@ToucheSir
Copy link
Member Author

Thanks, I'm running some benchmarks for multigate now.

There are actually even more opportunities for fusion in most RNN cells. For the LSTM and GRUv1, everything after the first 2 matmuls is pointwise ops, so combining them into one kernel is perfectly possible. That could then be made to operate in-place, with gradients provided by ForwardDiff or Enzyme.

@mcabbott
Copy link
Member

mcabbott commented Nov 7, 2021

Here's a quick attempt at fusion for h′ = @. σ(output) * tanh(c′):

using Zygote, NNlib
prodcast(f::Function, x::AbstractArray, g::Function, y::AbstractArray) = @. f(x) * g(y)
a, b, c, d = (rand(100,100) for _ in 1:4);
@btime gradient(sumprodcast, tanh, $a, sigmoid, $b);  # 216.792 μs (79 allocations: 549.22 KiB)
@btime gradient(sumprodcast, identity, $a, tanh, $b); # 147.500 μs (75 allocations: 392.86 KiB)
@btime copy($a);                                       #   2.932 μs (2 allocations: 78.17 KiB)

import ChainRulesCore: rrule, derivatives_given_output
derivatives_given_output(_, ::typeof(identity), _) = ((true,),)
function rrule(::typeof(prodcast), f, x, g, y)
    fx = f==identity ? x : f.(x)  # will need this to avoid re-computing tanh in the gradient
    gy = g==identity ? y : g.(y)
    size(x) == size(y) || throw("sizes must match")
    function uncast(Δraw)
        Δ = unthunk(Δraw)
        dx = first.(first.(derivatives_given_output.(fx, f, x))) .* gy .* Δ
        dy = if g==identity
            # first.(first.(derivatives_given_output.(gy, g, y))) .* fx .* Δ
            fx .* Δ
        else
            # we are free to overwrite gy, although should really check eltypes
            gy .= first.(first.(derivatives_given_output.(gy, g, y))) .* fx .* Δ
        end
        (NoTangent(), NoTangent(), dx, NoTangent(), dy)
    end
    fx .* gy, uncast
end
Zygote.refresh()
@btime gradient(sumprodcast, tanh, $a, sigmoid, $b);   # 213.916 μs (83 allocations: 314.94 KiB)  # -3 copies
@btime gradient(sumprodcast, identity, $a, tanh, $b);  # 149.625 μs (79 allocations: 236.73 KiB)

With tanh or sigmoid, the forward is expensive and the gradient cheap, so long as you still have the output. Which is why this is less than completely fused. For relu you could do better.

With ForwardDiff, you could have a fused forward pass but I think you then end up storing an array of tuples, containing the sensitivities, which is more memory than storing fx. Maybe it can be fewer kernel launches, not sure.

If you fused this with multigate, then instead of dx and dy you could write directly into the view of the larger array.

src/layers/recurrent.jl Outdated Show resolved Hide resolved
@ToucheSir
Copy link
Member Author

Ok, benchmarks with multigate are up, new vis here. Modulo a couple anomalies**, this PR is faster across the board. c.f. the same two figures from last post:

LSTM on GPU:
lstm_gpu

GRU on CPU:
gru_cpu

** there were a couple of runs where end-to-end performance on master was abnormally fast, way faster than the sum of forward and backwards passes separately. This machine is not exactly a noise-free benchmarking environment, so I don't think they're significant.

@ToucheSir
Copy link
Member Author

With ForwardDiff, you could have a fused forward pass but I think you then end up storing an array of tuples, containing the sensitivities, which is more memory than storing fx. Maybe it can be fewer kernel launches, not sure.

Good to know, I assumed it could switch to an SoA layout but presumably that is not the case. In fairness, writing out the rrule by hand isn't particularly difficult either. I think the cost-benefit of retaining fused broadcasts is worth the extra complexity, so if there's enough interest I can look into filing a follow-up PR. Longer term it would be great if we could pull in Enzyme + KernelAbstractions for this, but that's a discussion for another thread ;)

@mcabbott
Copy link
Member

mcabbott commented Nov 7, 2021

Even with hand-written fusion, my point is that either you run tanh twice (which is expensive) or you store its result somewhere until the backward pass (which costs memory). I think the ideal fusion story is still not as nice as the ordinary one without gradients. At least for expensive functions; for +, *, relu you don't need to make this choice.

What you could save with SoA might be the number of separate kernel launches, maybe, am not sure. On the CPU this is super-easy with StructArrays in fact.

But certainly another PR. Nice that this one gets you 10-20%.

@sdobber
Copy link
Contributor

sdobber commented Nov 9, 2021

I know I'm late to the party - sorry for that. I ran some of the benchmarks from FluxBench, and got about 5% faster runtimes for backward passes on my CPU. I think that's pretty nice, considering that there is also other stuff going on in the models apart from calling RNNs. Unfortunately, I don't have access to a reasonable GPU to try it there.

Timing [ms] Fwd bc/gate-view Bwd bc/gate-view Fwd master Bwd master Fwd speedup Bwd speedup
DARNN 25.235 663.732 26.188 695.946 3.6 % 4.6 %
LSTNet 3.32 95.005 3.34 102,026 0.6 % 6.9 %
TPA-LSTM 5.872 356.226 7.377 375,483 20.4 % 5.1 %

@DhairyaLGandhi
Copy link
Member

The activations are separate, we shouldn't switch those out for different versions by default I'd imagine.

@ToucheSir
Copy link
Member Author

ToucheSir commented Nov 10, 2021

Thanks all for the feedback and benchmarking, I think this is good to go.

@mcabbott re benchmarking harness, I just created a new file https://gist.github.com/ToucheSir/2fecbe99e5b304fed11e25e42c535cc3#file-rnn-jl under https://github.com/FluxML/Flux.jl/tree/master/perf and ran only those tests locally. A more robust approach would be to use FluxBench for this since it includes RNN models from FluxArchitectures, but I don't recall if that benchmarking can be made to run for a specific PR.

# AD-friendly helper for dividing monolithic RNN params into equally sized gates
multigate(x::AbstractArray, h, ::Val{N}) where N = ntuple(n -> gate(x,h,n), N)

@adjoint function multigate(x::AbstractArray, h, c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we strictly need this adjoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the difference between being slower on certain configurations and being strictly faster across all configurations, c.f. before and after. The more calls to gate, the more pronounced the effect: note how GRU cells called gate 6-8 times and also regressed the most (on smaller input sizes) without multigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better answer would be to see what part of gate regressed and fixing that instead.

Copy link
Member Author

@ToucheSir ToucheSir Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to gate itself was one line: https://github.com/FluxML/Flux.jl/pull/1761/files#diff-54816e9a4978b8c02648fdb29ebfd6d794452dbac8a28d0e84a5e2cc646a3fbfR4. Since the view and getindex use the same adjoint, there's no reason backwards pass performance should be slower (note forwards pass was consistently faster). Thus the only explanations seem to be a benchmarking artifact (note how this shows up only at smaller input sizes) and/or Zygote's compiler being unhappy for some reason (from profiling, almost all of the non-BLAS, non activation self time is spent in the generated Pullback for both cases).

However, what it did expose is that calling gate multiple times regardless of whether it uses view or slicing was inefficient, as the adjoint would allocate a full-sized buffer for the original array on every call. multigate resolves this by only allocating once, thus reducing both memory and (accumulation) compute by a factor of the number of gates. Even if gate wasn't using view, this would be a worthwhile optimization.

@ToucheSir
Copy link
Member Author

Are we good to go here? Any more concerns/comments?

@mcabbott
Copy link
Member

mcabbott commented Dec 13, 2021

bors r+

Error on 1.7 is now:

GroupedConvTranspose Layer GPU grad test: Error During Test at /var/lib/buildkite-agent/builds/gpuci-16/julialang/flux-dot-jl/test/cuda/layers.jl:49
  | Unexpected Pass
  | Expression: ≈(y_gpu, y_cpu, rtol = 0.001f0, atol = 0.001f0)
 ```

bors bot added a commit that referenced this pull request Dec 13, 2021
1761: Use view for RNN gate slice extraction r=mcabbott a=ToucheSir

This was originally passed over in #907, but I don't find the argument in that PR particularly compelling as the return value is only ever used once. Any negative impact on caching is going to happen anyhow during the slice materialization, so we might as well just let the subsequent fused broadcasts handle said materialization for us while reducing allocations.

Pinging `@jeremiedb,` `@sdobber` and `@mkschleg` if they have any interesting benchmarks to run this on. Otherwise I'll try to get something working with https://github.com/FluxML/Flux.jl/blob/master/perf/bench_utils.jl locally.

### PR Checklist

- [x] Tests are added
- [ ] Entry in NEWS.md
- [N/A] Documentation, if applicable
- [N/A] API changes require approval from a committer (different from the author, if applicable)


Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@bors
Copy link
Contributor

bors bot commented Dec 13, 2021

Build failed:

@ToucheSir
Copy link
Member Author

I was trying to investigate that in #1808, hope to get back to it in a few days.

@github-actions
Copy link
Contributor

Once the build has completed, you can preview any updated documentation at this URL: https://fluxml.ai/Flux.jl/previews/PR1761/ in ~20 minutes

@github-actions
Copy link
Contributor

Once the build has completed, you can preview any updated documentation at this URL: https://fluxml.ai/Flux.jl/previews/PR1761/ in ~20 minutes

@mcabbott
Copy link
Member

bors r+

@bors
Copy link
Contributor

bors bot commented Jan 16, 2022

Build succeeded:

@bors bors bot merged commit 75e3771 into master Jan 16, 2022
@DhairyaLGandhi DhairyaLGandhi deleted the bc/gate-view branch January 16, 2022 20:41
@ToucheSir ToucheSir mentioned this pull request Feb 24, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants