Description
Consider the situation where you have a function that does two things, and returns both outputs, but then in the outer function you only use one of them.
We would like the optimizer to remove the work that is being done for the part you don't use.
And it can do that on nightly, but not it is a subpart of a function that is being called (not even an inlined one). Only if the function being eleminated is directly being called.
Consider the following code:
using BenchmarkTools
# Effect inference can't seem to workout that this doesn't throw
# It does work out that it is effect_free, and terminates_globally
Base.@assume_effects :nothrow foo(x::Int) = [x]
bar(x) = foo(x), 2x
function bar_last_manual(x)
_ = foo(x)
return 2x
end
@ballocated bar_last_manual(1)
@code_warntype optimize=true bar_last_manual(1)
function bar_last_inlined(x)
_, ret = @inline bar(x)
return ret
end
@ballocated bar_last_inlined(1)
@code_warntype optimize=true bar_last_inlined(1)
we can see that bar_last_manual
acts as expected, it eliminated the allocation leaving just the multiplication
julia> @ballocated bar_last_manual(1)
0
julia> @code_warntype optimize=true bar_last_manual(1)
MethodInstance for bar_last_manual(::Int64)
from bar_last_manual(x) @ Main REPL[18]:1
Arguments
#self#::Core.Const(bar_last_manual)
x::Int64
Body::Int64
1 ─ %1 = Base.mul_int(2, x)::Int64
└── return %1
On the other hand if we do it via an inlined called to foo
, it does not:
julia> @ballocated bar_last_inlined(1)
64
julia> @code_warntype optimize=true bar_last_inlined(1)
MethodInstance for bar_last_inlined(::Int64)
from bar_last_inlined(x) @ Main REPL[21]:1
Arguments
#self#::Core.Const(bar_last_inlined)
x::Int64
Locals
val::Tuple{Vector{Int64}, Int64}
@_4::Int64
ret::Int64
Body::Int64
1 ─ %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Int64}, svec(Any, Int64), 0, :(:ccall), Vector{Int64}, 1, 1))::Vector{Int64}
└── goto #3 if not true
2 ─ Base.arrayset(false, %1, x, 1)
3 ┄ goto #4
4 ─ goto #5
5 ─ %6 = Base.mul_int(2, x)::Int64
└── goto #6
6 ─ return %6
This is sad.
Solving this would make batched forward-mode in Diffractor much easier.
Because it would mean we can effectively just run the pushforward part of the frule
s if we don't use the primal output.