Skip to content

Significant performance regression when precompiled #50749

Closed
@timholy

Description

@timholy

Using the teh/pc branch of my fork of VectorizedRNG seems to show worse performance with a precompile workload than without. The teh/pc branch has that workload commented-out, and I get (on an old CPU):

julia> using VectorizedRNG, Random

julia> drng = Random.default_rng(); lrng = local_rng();

julia> x64 = Vector{Float64}(undef, 255);

julia> using BenchmarkTools

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  134.000 ns  706.949 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     139.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.106 ns ±   7.303 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

The key thing is 139ns median run time. (I deleted the plot of the histogram since that doesn't transfer well to github.) Whereas when I uncomment the precompile block and use the same commands, I get

julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  1.594 μs  967.359 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.615 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.547 μs ±  24.092 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 0 bytes, allocs estimate: 0.

i.e., seemingly more than an order of magnitude slower.

I say seemingly because there's something quite interesting about the profile. Here's a snippet:

Fast version:

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   13 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    13 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     13 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...

whereas the slow version is

 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @BenchmarkTools/src/execution.jl:117; run
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:34; run_result
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:34; #run_result#45
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 @Base/essentials.jl:884; invokelatest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 @Base/essentials.jl:887; #invokelatest#2
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   28 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    28 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     28 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 25 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  25 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   25 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    18 [unknown stackframe]
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     16 @BenchmarkTools/src/execution.jl:495; var"##sample#236"(::Tuple{VectorizedRNG.Xoshiro{2}, Vector{Float64}}, __params::Benc...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 16 @BenchmarkTools/src/execution.jl:489; var"##core#235"(lrng#233::VectorizedRNG.Xoshiro{2}, x64#234::Vector{Float64})
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  16 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β::...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   14 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β:...
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    12 @VectorizedRNG/src/api.jl:314; samplevector!
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     8  @VectorizedRNG/src/api.jl:216; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 8  @VectorizedRNG/src/api.jl:34; random_uniform
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  7  @VectorizedRNG/src/masks.jl:66; floatbitmask
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   6  @VectorizedRNG/src/masks.jl:34; setbits
 ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    6  @VectorizedRNG/src/masks.jl:33; setbits

There is a puzzle here: if the two differ by an order of magnitude, why is the number of samples approximately the same? (At top-level, the slow one has 30 samples and the fast one 28.) Perhaps samples=10000 evals=1 is insufficient to ensure that they are running the same workload, but it's not entirely clear.

CC @chriselrod

Metadata

Metadata

Assignees

No one assigned

    Labels

    compiler:precompilationPrecompilation of modulesperformanceMust go fasterregressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions