Complicated lazy broadcasting slower than equivalent single broadcast

This example [from Discourse](https://discourse.julialang.org/t/gpu-julia-vs-gpu-matlab/122663/61) shows a slowdown when broadcasting a moderately complicated expression, instead of broadcasting a function containing the same expression:
```julia
arrayfun!(C, A, B) = @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
scalarfun(A::Real, B::Real) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B

let N = 151
    A, B, C1, C2 = (rand(N,N,N).+1 for _ in 1:4)
    @btime arrayfun!($C1, $A, $B)
    @btime $C2 .= scalarfun.($A, $B)
    C1 ≈ C2
end
#  17.306 ms (11 allocations: 352 bytes)
#   5.900 ms (0 allocations: 0 bytes)
```
The effect seems fairly robust, it's not particular to 3D arrays, nor to `A^2`. 
Replacing `@.` with `.+` etc. helps a bit (which according to #29120 removes n-ary `+`, here n<=4):
```julia
arrayfun!(C, A, B) = C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B .- A .* B .- A ./ B .+ A .* B .+ A ./ B .- A .* B .- A ./ B
#  17.345 ms (0 allocations: 0 bytes)
```
Simpler expressions also have the slowdown but no allocation:
```julia
arrayfun!(C, A, B) = @. C = A^2 + B^2 + A * B + A / B
scalarfun(A::Real, B::Real) = A^2 + B^2 + A * B + A / B
#  3.148 ms (0 allocations: 0 bytes)
#  971.000 μs (0 allocations: 0 bytes)
```
Even simpler expressions like `arrayfun!(C, A, B) = @. C = A^2 + B^2` show no slowdown at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Complicated lazy broadcasting slower than equivalent single broadcast #56629

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Complicated lazy broadcasting slower than equivalent single broadcast #56629

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions