-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow code and excessive allocations when backpropagating broadcasting #592
Comments
It seems that this issue originates in the poor performance of the general fallback for julia> using Zygote, BenchmarkTools
julia> using Zygote: @adjoint
julia> using Base.Broadcast: broadcasted
julia> mysin(x) = sin(x)
mysin (generic function with 1 method)
julia> @adjoint function broadcasted(::typeof(mysin), x)
mysin.(x), ȳ -> (nothing, ȳ .* cos.(x))
end
julia> function f(x)
y = sin.(x)
sum(y)
end
f (generic function with 1 method)
julia> function g(x) #calculates the same as f(x), but with my own adjoint
y = mysin.(x)
sum(y)
end
g (generic function with 1 method)
julia> x = randn(Float32, 1000_000);
julia> @btime f($x);
9.482 ms (2 allocations: 3.81 MiB)
julia> @btime gradient(f, $x);
70.203 ms (3000047 allocations: 68.67 MiB)
julia> @btime g($x);
9.501 ms (2 allocations: 3.81 MiB)
julia> @btime gradient(g, $x);
18.903 ms (20 allocations: 7.63 MiB) #This is what we want/naively expect! I don't have a good understanding of the inner workings of the general fallback... Is it possible to resolve this issue, or must a custom broadcasting adjoint be implemented case-by-case? (The fallback must have gotten worse, though, as this issue probably is the underlying reason for why #588 was needed now, but hasn't been needed/noticed before...) |
Maybe we can add a message to the general fallback saying that adding an adjoint might help with performance there |
Some questions:
|
I'm not aware of any recent change for the broadcast infrastructure. Could you perform the same benchmarks on some previous Zygote release? If performance is the same (which is still an issue, but not a regression), maybe the slowdown you observe is related to CuArrays being updated to version 2 recently. |
Yeah, I'm starting to suspect that this could be the main reason. Looking into it now. |
It seems that the main performance issue when dealing with CuArrays was just (more or less) solved by JuliaGPU/CUDAnative.jl#624. The issue with having to construct custom broadcasting adjoints if maximum performance is desired, still stands, though. (But yeah, as you said @CarloLucibello, that's not really a regression of the Zygote code.) |
We don't always need custom adjoints for broadcasting. As you noticed, the change is due to how dependencies are updated. This is rather an indication that performance sensitive components should more or less be bound to having the necessary optimisations in place. |
Operations that are performance sensitive and don't perform well with current assumptions, would however need the custom adjoints. Many are even computed at compile time. |
Does #1001 close this? |
On v0.6.49:
So I'd agree both the slowness and allocations have been addressed. |
There seems to be a bug lurking somewhere in a recent update of Zygote and/or dependencies (I cannot really pinpoint it, but I noticed a significant slowdown of some code when coming back to a project after a few weeks absence). One (the?) issue seems to be in differentiating broadcasted expressions, even simple ones:
As you can see from the number of allocations and the execution time, there must be something wrong with the evaluation of the backpropagator.
I haven't been able to uncover the source of this issue, but if anyone could point me in the right direction, I'm happy to contribute in solving this.
Version info:
The text was updated successfully, but these errors were encountered: