-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more type-stable type-inference #41697
Conversation
(this PR is the final output of my demo at [our workshop](https://github.com/aviatesk/juliacon2021-workshop-pkgdev)) This PR eliminated much of runtime dispatches within our type inference routine, that are reported by the following JET analysis: ```julia using JETTest const CC = Core.Compiler function function_filter(@nospecialize(ft)) ft === typeof(CC.isprimitivetype) && return false ft === typeof(CC.ismutabletype) && return false ft === typeof(CC.isbitstype) && return false ft === typeof(CC.widenconst) && return false ft === typeof(CC.widenconditional) && return false ft === typeof(CC.widenwrappedconditional) && return false ft === typeof(CC.maybe_extract_const_bool) && return false ft === typeof(CC.ignorelimited) && return false return true end function frame_filter((; linfo) = sv) meth = linfo.def isa(meth, Method) || return true return occursin("compiler/", string(meth.file)) end report_dispatch(CC.typeinf, (CC.NativeInterpreter, CC.InferenceState); function_filter, frame_filter) ``` > on master ``` ═════ 137 possible errors found ═════ ... ``` > on this PR ``` ═════ 51 possible errors found ═════ ... ``` And it seems like this PR makes JIT slightly faster: > on master ```julia ~/julia/julia master ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3));' 3.659865 seconds (7.19 M allocations: 497.982 MiB, 3.94% gc time, 0.39% compilation time) 2.696410 seconds (3.62 M allocations: 202.905 MiB, 7.49% gc time, 56.39% compilation time) ``` > on this PR ```julia ~/julia/julia avi/jetdemo* 7s ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3));' 3.396974 seconds (7.16 M allocations: 491.442 MiB, 4.80% gc time, 0.28% compilation time) 2.591130 seconds (3.48 M allocations: 196.026 MiB, 7.29% gc time, 56.72% compilation time) ```
@@ -85,9 +85,12 @@ function abstract_call_gf_by_type(interp::AbstractInterpreter, @nospecialize(f), | |||
push!(edges, edge) | |||
end | |||
this_argtypes = isa(matches, MethodMatches) ? argtypes : matches.applicable_argtypes[i] | |||
const_rt, const_result = abstract_call_method_with_const_args(interp, result, f, this_argtypes, match, sv, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is less type stable, so it seems harder to optimize. Why is is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is our optimizer will emit a bit complex code when we're splatting union-type tuple:
julia> code_typed((Union{Tuple{Int,Nothing}, Tuple{Integer,String}},)) do tpl
if getfield(tpl, 2) !== nothing
a, b = tpl
return a, b
end
return a, b
end |> first
CodeInfo(
1 ─ %1 = Main.getfield(tpl, 2)::Union{Nothing, String}
│ %2 = (%1 === Main.nothing)::Bool
│ %3 = Core.Intrinsics.not_int(%2)::Bool
└── goto #9 if not %3
2 ─ %5 = (isa)(tpl, Tuple{Int64, Nothing})::Bool
└── goto #4 if not %5
3 ─ %7 = π (tpl, Tuple{Int64, Nothing})
│ %8 = Base.getfield(%7, 1)::Union{Nothing, Int64}
│ %9 = Base.add_int(1, 1)::Int64
│ %10 = Core.tuple(%8, %9)::Tuple{Union{Nothing, Int64}, Int64}
└── goto #5
4 ─ %12 = Base.indexed_iterate(tpl, 1)::Tuple{Integer, Int64}
└── goto #5
5 ┄ %14 = φ (#3 => %10, #4 => %12)::Tuple{Integer, Int64}
│ %15 = Core.getfield(%14, 1)::Integer
│ %16 = (isa)(tpl, Tuple{Int64, Nothing})::Bool
└── goto #7 if not %16
6 ─ %18 = π (tpl, Tuple{Int64, Nothing})
│ %19 = Base.getfield(%18, 2)::Union{Nothing, Int64}
│ %20 = Base.add_int(2, 1)::Int64
│ %21 = Core.tuple(%19, %20)::Tuple{Union{Nothing, Int64}, Int64}
└── goto #8
7 ─ %23 = Base.indexed_iterate(tpl, 2, 2)::Union{Tuple{Nothing, Int64}, Tuple{String, Int64}}
└── goto #8
8 ┄ %25 = φ (#6 => %21, #7 => %23)::Union{Tuple{Nothing, Int64}, Tuple{String, Int64}}
│ %26 = Core.getfield(%25, 1)::Union{Nothing, String}
│ %27 = Core.tuple(%15, %26)::Tuple{Integer, Union{Nothing, String}}
└── return %27
9 ─ $(Expr(:throw_undef_if_not, :a, false))::Any
└── unreachable
) => Tuple{Integer, Union{Nothing, String}}
julia> code_typed((Union{Nothing, Tuple{Integer,String}},)) do tpl
if tpl !== nothing
a, b = tpl
return a, b
end
return nothing
end |> first
CodeInfo(
1 ─ %1 = (tpl === Main.nothing)::Bool
│ %2 = Core.Intrinsics.not_int(%1)::Bool
└── goto #3 if not %2
2 ─ %4 = π (tpl, Tuple{Integer, String})
│ %5 = Base.getfield(%4, 1)::Integer
│ %6 = π (tpl, Tuple{Integer, String})
│ %7 = Base.getfield(%6, 2)::String
│ %8 = Core.tuple(%5, %7)::Tuple{Integer, String}
└── return %8
3 ─ return Main.nothing
) => Union{Nothing, Tuple{Integer, String}}
So we can get a performance improvement if we branch on isnothing
and avoid union-tuple:
julia> using BenchmarkTools
julia> function f1(ts)
local a, b
for t in ts
if getfield(t, 2) !== nothing
a, b = t
end
end
return a, b
end
f1 (generic function with 1 method)
julia> function f2(ts)
local a, b
for t in ts
if t !== nothing
a, b = t
end
end
return a, b
end
f2 (generic function with 1 method)
julia> let ts = subtypes(Any)
ts1 = Union{Tuple{DataType,Nothing},Tuple{Any,Char}}[]
ts2 = Union{Nothing,Tuple{Any,Char}}[]
for _ in 1:1000
if rand(Bool)
push!(ts1, (Any, nothing)); push!(ts2, nothing)
else
t = rand(ts), rand(Char)
push!(ts1, t); push!(ts2, t)
end
end
display(@benchmark(f1($ts1)))
ts2 = Union{Nothing,Tuple{Any,Char}}[
rand(Bool) ? nothing : (rand(ts), rand(Char)) for _ in 1:1000]
display(@benchmark(f2($ts2)))
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 123.927 μs … 4.081 ms ┊ GC (min … max): 0.00% … 95.08%
Time (median): 130.366 μs ┊ GC (median): 0.00%
Time (mean ± σ): 150.662 μs ± 107.727 μs ┊ GC (mean ± σ): 2.69% ± 3.84%
▇█▇▅▄▃▂▁▂▂▂▂▂▂▁▁ ▁▂▂▁▁▁ ▁ ▂
█████████████████████████████████▆█▇█▇▆▇▇▇▆▇▇▆▆▆▆▆▇▅▆▅▅▅▅▅▄▅▅ █
124 μs Histogram: log(frequency) by time 308 μs <
Memory estimate: 82.19 KiB, allocs estimate: 3347.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 17.817 μs … 145.613 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 18.958 μs ┊ GC (median): 0.00%
Time (mean ± σ): 21.272 μs ± 8.474 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▆▄▁▁ ▁
███████▇▇▇▇▇▇▇▇▇▇▇▇▆▆▆▆▅▅▅▆▄▅▄▄▅▆▅▆▆▆▆▇▇▇▆▆▆▇▇▇▇▆▆▃▃▃▄▄▂▅▆▅▅ █
17.8 μs Histogram: log(frequency) by time 57.6 μs <
Memory estimate: 8.08 KiB, allocs estimate: 516.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is not what I expected that code to infer to as I was expecting Tuple{Any,Union{Char,Nothing}}
. That particular tuple we could then potentially add codegen optimizations for, so that we don't box it immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with that special handling of union-tuple in the optimizer would be worth to do.
For now I want to have this change (currently the return type of abstract_call_method_with_const_args
is Union{Tuple{DataType, Nothing}, Tuple{Any, InferenceResult}}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see that. I did not expect that inference could do so accurately:
julia> Base.return_types(Core.Compiler.abstract_call_method_with_const_args)
1-element Vector{Any}:
Union{Tuple{Any, Core.Compiler.InferenceResult}, Tuple{DataType, Nothing}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, do you think we can move this PR forward ?
(this PR is the final output of my demo at [our workshop](https://github.com/aviatesk/juliacon2021-workshop-pkgdev)) This PR eliminated much of runtime dispatches within our type inference routine, that are reported by the following JET analysis: ```julia using JETTest const CC = Core.Compiler function function_filter(@nospecialize(ft)) ft === typeof(CC.isprimitivetype) && return false ft === typeof(CC.ismutabletype) && return false ft === typeof(CC.isbitstype) && return false ft === typeof(CC.widenconst) && return false ft === typeof(CC.widenconditional) && return false ft === typeof(CC.widenwrappedconditional) && return false ft === typeof(CC.maybe_extract_const_bool) && return false ft === typeof(CC.ignorelimited) && return false return true end function frame_filter((; linfo) = sv) meth = linfo.def isa(meth, Method) || return true return occursin("compiler/", string(meth.file)) end report_dispatch(CC.typeinf, (CC.NativeInterpreter, CC.InferenceState); function_filter, frame_filter) ``` > on master ``` ═════ 137 possible errors found ═════ ... ``` > on this PR ``` ═════ 51 possible errors found ═════ ... ``` And it seems like this PR makes JIT slightly faster: > on master ```julia ~/julia/julia master ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3));' 3.659865 seconds (7.19 M allocations: 497.982 MiB, 3.94% gc time, 0.39% compilation time) 2.696410 seconds (3.62 M allocations: 202.905 MiB, 7.49% gc time, 56.39% compilation time) ``` > on this PR ```julia ~/julia/julia avi/jetdemo* 7s ❯ ./usr/bin/julia -e '@time using Plots; @time plot(rand(10,3));' 3.396974 seconds (7.16 M allocations: 491.442 MiB, 4.80% gc time, 0.28% compilation time) 2.591130 seconds (3.48 M allocations: 196.026 MiB, 7.29% gc time, 56.72% compilation time) ``` cherry-picked from 795935f
(this PR is the final output of my demo at our workshop)
This PR eliminated much of runtime dispatches within our type inference
routine, that are reported by the following JET analysis:
And it seems like this PR makes JIT slightly faster: