Slow code and excessive allocations when backpropagating broadcasting #592

Sleort · 2020-04-08T10:01:52Z

There seems to be a bug lurking somewhere in a recent update of Zygote and/or dependencies (I cannot really pinpoint it, but I noticed a significant slowdown of some code when coming back to a project after a few weeks absence). One (the?) issue seems to be in differentiating broadcasted expressions, even simple ones:

julia> using Zygote, BenchmarkTools

julia> function f(x)
          y = sin.(x)
          sum(y)
       end

julia> x = rand(Float32, 1000_000);

julia> @benchmark f(x)
BenchmarkTools.Trial: 
  memory estimate:  3.81 MiB
  allocs estimate:  3
  --------------
  minimum time:     5.135 ms (0.00% GC)
  median time:      5.323 ms (0.00% GC)
  mean time:        5.421 ms (1.51% GC)
  maximum time:     7.019 ms (13.66% GC)
  --------------
  samples:          922
  evals/sample:     1

julia> @benchmark gradient(f, x)
BenchmarkTools.Trial: 
  memory estimate:  68.67 MiB
  allocs estimate:  3000055   #Note the number of allocations here!
  --------------
  minimum time:     63.148 ms (2.50% GC)   #... The evaluation time is an order of magnitude larger than expected (see below)
  median time:      64.717 ms (3.63% GC)
  mean time:        64.495 ms (3.62% GC)
  maximum time:     69.714 ms (2.22% GC)
  --------------
  samples:          78
  evals/sample:     1

julia> @benchmark Zygote.pullback(f, x)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  29
  --------------
  minimum time:     7.143 ms (0.00% GC)   #Note that just constructing the pullback itself has a reasonable (?) overhead
  median time:      7.348 ms (0.00% GC)
  mean time:        7.675 ms (4.31% GC)
  maximum time:     9.309 ms (14.22% GC)
  --------------
  samples:          651
  evals/sample:     1

julia> @benchmark cos.(rand(Float32, 1000_000)) #An approximate estimate of cost of evaluating the backpropagator
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  4
  --------------
  minimum time:     5.945 ms (0.00% GC)
  median time:      6.143 ms (0.00% GC)
  mean time:        6.323 ms (2.69% GC)
  maximum time:     7.929 ms (0.00% GC)
  --------------
  samples:          790
  evals/sample:     1

As you can see from the number of allocations and the execution time, there must be something wrong with the evaluation of the backpropagator.

I haven't been able to uncover the source of this issue, but if anyone could point me in the right direction, I'm happy to contribute in solving this.

Version info:

(@v1.4) pkg> st Zygote
Status `~/.julia/environments/v1.4/Project.toml`
  [e88e6eb3] Zygote v0.4.13 #master (https://github.com/FluxML/Zygote.jl.git)

Sleort · 2020-04-08T13:33:22Z

It seems that this issue originates in the poor performance of the general fallback for @adjoint broadcasted (this, or some of its dependencies must have changed recently?). If I define a custom adjoint, the problem goes away. For example:

julia> using Zygote, BenchmarkTools

julia> using Zygote: @adjoint

julia> using Base.Broadcast: broadcasted

julia> mysin(x) = sin(x)
mysin (generic function with 1 method)

julia> @adjoint function broadcasted(::typeof(mysin), x)
         mysin.(x), ȳ -> (nothing, ȳ .* cos.(x))
       end

julia> function f(x)
           y = sin.(x)
           sum(y)
       end
f (generic function with 1 method)

julia> function g(x) #calculates the same as f(x), but with my own adjoint
           y = mysin.(x)
           sum(y)
       end
g (generic function with 1 method)

julia> x = randn(Float32, 1000_000);

julia> @btime f($x);
  9.482 ms (2 allocations: 3.81 MiB)

julia> @btime gradient(f, $x);
  70.203 ms (3000047 allocations: 68.67 MiB)

julia> @btime g($x);
  9.501 ms (2 allocations: 3.81 MiB)

julia> @btime gradient(g, $x);
  18.903 ms (20 allocations: 7.63 MiB)  #This is what we want/naively expect!

I don't have a good understanding of the inner workings of the general fallback... Is it possible to resolve this issue, or must a custom broadcasting adjoint be implemented case-by-case?

(The fallback must have gotten worse, though, as this issue probably is the underlying reason for why #588 was needed now, but hasn't been needed/noticed before...)

DhairyaLGandhi · 2020-04-08T14:04:07Z

Maybe we can add a message to the general fallback saying that adding an adjoint might help with performance there

Sleort · 2020-04-09T05:19:44Z

Maybe we can add a message to the general fallback saying that adding an adjoint might help with performance there

Some questions:

Does that mean that whenever one needs good broadcasting performance, the advice is that one should add a custom adjoint, even when the broadcasted function itself has an adjoint? Is that the current recommendation?
Should I submit PRs to add new custom adjoints as I come by them, e.g. the sin example above? (That could quickly lead to a very long list of custom adjoints, which doesn't seem like a very elegant solution...)
This didn't seem as necessary a month or so ago. Has something changed, or did I just not notice before? (The performance regression of my (CuArrays centric) code could of course have other origins...)

CarloLucibello · 2020-04-09T08:31:14Z

I'm not aware of any recent change for the broadcast infrastructure. Could you perform the same benchmarks on some previous Zygote release?

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

Sleort · 2020-04-09T08:36:52Z

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

Yeah, I'm starting to suspect that this could be the main reason. Looking into it now.

Sleort · 2020-04-09T13:12:48Z

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

Yeah, I'm starting to suspect that this could be the main reason. Looking into it now.

It seems that the main performance issue when dealing with CuArrays was just (more or less) solved by JuliaGPU/CUDAnative.jl#624.

The issue with having to construct custom broadcasting adjoints if maximum performance is desired, still stands, though. (But yeah, as you said @CarloLucibello, that's not really a regression of the Zygote code.)

DhairyaLGandhi · 2020-04-09T13:17:11Z

We don't always need custom adjoints for broadcasting. As you noticed, the change is due to how dependencies are updated. This is rather an indication that performance sensitive components should more or less be bound to having the necessary optimisations in place.

DhairyaLGandhi · 2020-04-09T13:19:06Z

Operations that are performance sensitive and don't perform well with current assumptions, would however need the custom adjoints. Many are even computed at compile time.

cossio · 2020-04-25T14:10:42Z

Related: #338, #336

cossio · 2021-10-01T12:42:37Z

Does #1001 close this?

ToucheSir · 2022-11-17T15:25:14Z

On v0.6.49:

julia> @benchmark gradient(f, $x)
BenchmarkTools.Trial: 508 samples with 1 evaluation.
 Range (min … max):  8.900 ms … 14.084 ms  ┊ GC (min … max): 0.00% … 7.95%
 Time  (median):     8.962 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.836 ms ±  1.404 ms  ┊ GC (mean ± σ):  3.78% ± 5.96%

  █▃          ▁▃            ▃▂                             ▄  
  ██▆▅▄▁▁▁▁▁▁▁██▅▄▁▁▁▁▁▁▁▄▁▁██▅▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▁▁▁▄▁▇█ ▆
  8.9 ms       Histogram: log(frequency) by time     13.1 ms <

 Memory estimate: 15.26 MiB, allocs estimate: 29.

julia> @benchmark cos.($x)
BenchmarkTools.Trial: 933 samples with 1 evaluation.
 Range (min … max):  5.260 ms …   6.525 ms  ┊ GC (min … max): 0.00% … 19.23%
 Time  (median):     5.271 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.351 ms ± 283.560 μs  ┊ GC (mean ± σ):  1.40% ±  4.39%

  █▄                                                       ▂   
  ███▇▇▇▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█▇ ▇
  5.26 ms      Histogram: log(frequency) by time      6.43 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

So I'd agree both the slowness and allocations have been addressed.

yebai mentioned this issue Apr 27, 2020

Simple benchmarking for Tracker, ReverseDiff, and Zygote TuringLang/Turing.jl#1140

Closed

mcabbott mentioned this issue Jun 19, 2021

Faster generic broadcasting #1001

Merged

mcabbott added the performance label Jul 22, 2022

ToucheSir closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow code and excessive allocations when backpropagating broadcasting #592

Slow code and excessive allocations when backpropagating broadcasting #592

Sleort commented Apr 8, 2020

Sleort commented Apr 8, 2020

DhairyaLGandhi commented Apr 8, 2020

Sleort commented Apr 9, 2020

CarloLucibello commented Apr 9, 2020

Sleort commented Apr 9, 2020

Sleort commented Apr 9, 2020

DhairyaLGandhi commented Apr 9, 2020

DhairyaLGandhi commented Apr 9, 2020

cossio commented Apr 25, 2020

cossio commented Oct 1, 2021

ToucheSir commented Nov 17, 2022

Slow code and excessive allocations when backpropagating broadcasting #592

Slow code and excessive allocations when backpropagating broadcasting #592

Comments

Sleort commented Apr 8, 2020

Sleort commented Apr 8, 2020

DhairyaLGandhi commented Apr 8, 2020

Sleort commented Apr 9, 2020

CarloLucibello commented Apr 9, 2020

Sleort commented Apr 9, 2020

Sleort commented Apr 9, 2020

DhairyaLGandhi commented Apr 9, 2020

DhairyaLGandhi commented Apr 9, 2020

cossio commented Apr 25, 2020

cossio commented Oct 1, 2021

ToucheSir commented Nov 17, 2022