Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow code and excessive allocations when backpropagating broadcasting #592

Closed
Sleort opened this issue Apr 8, 2020 · 11 comments
Closed

Comments

@Sleort
Copy link
Contributor

Sleort commented Apr 8, 2020

There seems to be a bug lurking somewhere in a recent update of Zygote and/or dependencies (I cannot really pinpoint it, but I noticed a significant slowdown of some code when coming back to a project after a few weeks absence). One (the?) issue seems to be in differentiating broadcasted expressions, even simple ones:

julia> using Zygote, BenchmarkTools

julia> function f(x)
          y = sin.(x)
          sum(y)
       end

julia> x = rand(Float32, 1000_000);

julia> @benchmark f(x)
BenchmarkTools.Trial: 
  memory estimate:  3.81 MiB
  allocs estimate:  3
  --------------
  minimum time:     5.135 ms (0.00% GC)
  median time:      5.323 ms (0.00% GC)
  mean time:        5.421 ms (1.51% GC)
  maximum time:     7.019 ms (13.66% GC)
  --------------
  samples:          922
  evals/sample:     1

julia> @benchmark gradient(f, x)
BenchmarkTools.Trial: 
  memory estimate:  68.67 MiB
  allocs estimate:  3000055   #Note the number of allocations here!
  --------------
  minimum time:     63.148 ms (2.50% GC)   #... The evaluation time is an order of magnitude larger than expected (see below)
  median time:      64.717 ms (3.63% GC)
  mean time:        64.495 ms (3.62% GC)
  maximum time:     69.714 ms (2.22% GC)
  --------------
  samples:          78
  evals/sample:     1

julia> @benchmark Zygote.pullback(f, x)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  29
  --------------
  minimum time:     7.143 ms (0.00% GC)   #Note that just constructing the pullback itself has a reasonable (?) overhead
  median time:      7.348 ms (0.00% GC)
  mean time:        7.675 ms (4.31% GC)
  maximum time:     9.309 ms (14.22% GC)
  --------------
  samples:          651
  evals/sample:     1

julia> @benchmark cos.(rand(Float32, 1000_000)) #An approximate estimate of cost of evaluating the backpropagator
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  4
  --------------
  minimum time:     5.945 ms (0.00% GC)
  median time:      6.143 ms (0.00% GC)
  mean time:        6.323 ms (2.69% GC)
  maximum time:     7.929 ms (0.00% GC)
  --------------
  samples:          790
  evals/sample:     1

As you can see from the number of allocations and the execution time, there must be something wrong with the evaluation of the backpropagator.

I haven't been able to uncover the source of this issue, but if anyone could point me in the right direction, I'm happy to contribute in solving this.


Version info:

(@v1.4) pkg> st Zygote
Status `~/.julia/environments/v1.4/Project.toml`
  [e88e6eb3] Zygote v0.4.13 #master (https://github.com/FluxML/Zygote.jl.git)
@Sleort
Copy link
Contributor Author

Sleort commented Apr 8, 2020

It seems that this issue originates in the poor performance of the general fallback for @adjoint broadcasted (this, or some of its dependencies must have changed recently?). If I define a custom adjoint, the problem goes away. For example:

julia> using Zygote, BenchmarkTools

julia> using Zygote: @adjoint

julia> using Base.Broadcast: broadcasted

julia> mysin(x) = sin(x)
mysin (generic function with 1 method)

julia> @adjoint function broadcasted(::typeof(mysin), x)
         mysin.(x), ȳ -> (nothing, ȳ .* cos.(x))
       end

julia> function f(x)
           y = sin.(x)
           sum(y)
       end
f (generic function with 1 method)

julia> function g(x) #calculates the same as f(x), but with my own adjoint
           y = mysin.(x)
           sum(y)
       end
g (generic function with 1 method)

julia> x = randn(Float32, 1000_000);

julia> @btime f($x);
  9.482 ms (2 allocations: 3.81 MiB)

julia> @btime gradient(f, $x);
  70.203 ms (3000047 allocations: 68.67 MiB)

julia> @btime g($x);
  9.501 ms (2 allocations: 3.81 MiB)

julia> @btime gradient(g, $x);
  18.903 ms (20 allocations: 7.63 MiB)  #This is what we want/naively expect!

I don't have a good understanding of the inner workings of the general fallback... Is it possible to resolve this issue, or must a custom broadcasting adjoint be implemented case-by-case?

(The fallback must have gotten worse, though, as this issue probably is the underlying reason for why #588 was needed now, but hasn't been needed/noticed before...)

@DhairyaLGandhi
Copy link
Member

Maybe we can add a message to the general fallback saying that adding an adjoint might help with performance there

@Sleort
Copy link
Contributor Author

Sleort commented Apr 9, 2020

Maybe we can add a message to the general fallback saying that adding an adjoint might help with performance there

Some questions:

  1. Does that mean that whenever one needs good broadcasting performance, the advice is that one should add a custom adjoint, even when the broadcasted function itself has an adjoint? Is that the current recommendation?
  2. Should I submit PRs to add new custom adjoints as I come by them, e.g. the sin example above? (That could quickly lead to a very long list of custom adjoints, which doesn't seem like a very elegant solution...)
  3. This didn't seem as necessary a month or so ago. Has something changed, or did I just not notice before? (The performance regression of my (CuArrays centric) code could of course have other origins...)

@CarloLucibello
Copy link
Member

I'm not aware of any recent change for the broadcast infrastructure. Could you perform the same benchmarks on some previous Zygote release?

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

@Sleort
Copy link
Contributor Author

Sleort commented Apr 9, 2020

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

Yeah, I'm starting to suspect that this could be the main reason. Looking into it now.

@Sleort
Copy link
Contributor Author

Sleort commented Apr 9, 2020

If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently.

Yeah, I'm starting to suspect that this could be the main reason. Looking into it now.

It seems that the main performance issue when dealing with CuArrays was just (more or less) solved by JuliaGPU/CUDAnative.jl#624.

The issue with having to construct custom broadcasting adjoints if maximum performance is desired, still stands, though. (But yeah, as you said @CarloLucibello, that's not really a regression of the Zygote code.)

@DhairyaLGandhi
Copy link
Member

We don't always need custom adjoints for broadcasting. As you noticed, the change is due to how dependencies are updated. This is rather an indication that performance sensitive components should more or less be bound to having the necessary optimisations in place.

@DhairyaLGandhi
Copy link
Member

Operations that are performance sensitive and don't perform well with current assumptions, would however need the custom adjoints. Many are even computed at compile time.

@cossio
Copy link
Contributor

cossio commented Apr 25, 2020

Related: #338, #336

@cossio
Copy link
Contributor

cossio commented Oct 1, 2021

Does #1001 close this?

@ToucheSir
Copy link
Member

On v0.6.49:

julia> @benchmark gradient(f, $x)
BenchmarkTools.Trial: 508 samples with 1 evaluation.
 Range (min … max):  8.900 ms … 14.084 ms  ┊ GC (min … max): 0.00% … 7.95%
 Time  (median):     8.962 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.836 ms ±  1.404 ms  ┊ GC (mean ± σ):  3.78% ± 5.96%

  █▃          ▁▃            ▃▂                             ▄  
  ██▆▅▄▁▁▁▁▁▁▁██▅▄▁▁▁▁▁▁▁▄▁▁██▅▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▁▁▁▄▁▇█ ▆
  8.9 ms       Histogram: log(frequency) by time     13.1 ms <

 Memory estimate: 15.26 MiB, allocs estimate: 29.

julia> @benchmark cos.($x)
BenchmarkTools.Trial: 933 samples with 1 evaluation.
 Range (min … max):  5.260 ms …   6.525 ms  ┊ GC (min … max): 0.00% … 19.23%
 Time  (median):     5.271 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.351 ms ± 283.560 μs  ┊ GC (mean ± σ):  1.40% ±  4.39%

  █▄                                                       ▂   
  ███▇▇▇▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█▇ ▇
  5.26 ms      Histogram: log(frequency) by time      6.43 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

So I'd agree both the slowness and allocations have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants