Description
Using the teh/pc
branch of my fork of VectorizedRNG seems to show worse performance with a precompile workload than without. The teh/pc
branch has that workload commented-out, and I get (on an old CPU):
julia> using VectorizedRNG, Random
julia> drng = Random.default_rng(); lrng = local_rng();
julia> x64 = Vector{Float64}(undef, 255);
julia> using BenchmarkTools
julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 134.000 ns … 706.949 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 139.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 229.106 ns ± 7.303 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
The key thing is 139ns median run time. (I deleted the plot of the histogram since that doesn't transfer well to github.) Whereas when I uncomment the precompile block and use the same commands, I get
julia> @bprofile rand!($lrng, $x64) samples=10000 evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 1.594 μs … 967.359 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.615 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.547 μs ± 24.092 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
i.e., seemingly more than an order of magnitude slower.
I say seemingly because there's something quite interesting about the profile. Here's a snippet:
Fast version:
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 [unknown stackframe]
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:117; run
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:34; run_result
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:34; #run_result#45
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @Base/essentials.jl:884; invokelatest
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @Base/essentials.jl:887; #invokelatest#2
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 [unknown stackframe]
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 13 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...
whereas the slow version is
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:774; jl_interpret_toplevel_thunk
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:543; eval_body
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:488; eval_body
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:222; eval_value
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/interpreter.c:125; do_call
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 [unknown stackframe]
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:117; run
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:117; run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Fl...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:34; run_result
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:34; #run_result#45
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @Base/essentials.jl:884; invokelatest
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @Base/essentials.jl:887; #invokelatest#2
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/builtins.c:812; jl_f__call_latest
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/julia.h:1969; jl_apply
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 [unknown stackframe]
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:92; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 28 @BenchmarkTools/src/execution.jl:105; _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::Stri...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 25 /home/tim/src/julia-branch/src/gf.c:3070; ijl_apply_generic
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 25 /home/tim/src/julia-branch/src/gf.c:2876; _jl_invoke
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 25 /home/tim/src/julia-branch/src/gf.c:2867; invoke_codeinst
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 18 [unknown stackframe]
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 16 @BenchmarkTools/src/execution.jl:495; var"##sample#236"(::Tuple{VectorizedRNG.Xoshiro{2}, Vector{Float64}}, __params::Benc...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 16 @BenchmarkTools/src/execution.jl:489; var"##core#235"(lrng#233::VectorizedRNG.Xoshiro{2}, x64#234::Vector{Float64})
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 16 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β::...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 14 @VectorizedRNG/src/api.jl:366; rand!(rng::VectorizedRNG.Xoshiro{2}, x::Vector{Float64}, α::Static.StaticInt{0}, β:...
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 12 @VectorizedRNG/src/api.jl:314; samplevector!
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 8 @VectorizedRNG/src/api.jl:216; random_uniform
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 8 @VectorizedRNG/src/api.jl:34; random_uniform
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 7 @VectorizedRNG/src/masks.jl:66; floatbitmask
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 6 @VectorizedRNG/src/masks.jl:34; setbits
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 6 @VectorizedRNG/src/masks.jl:33; setbits
There is a puzzle here: if the two differ by an order of magnitude, why is the number of samples approximately the same? (At top-level, the slow one has 30 samples and the fast one 28.) Perhaps samples=10000 evals=1
is insufficient to ensure that they are running the same workload, but it's not entirely clear.
CC @chriselrod