-
Notifications
You must be signed in to change notification settings - Fork 10
Peformance Optimization #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmarks on master:
Benchmarks on this PR:
The allocation only reduced A LITTLE. There must be some other places where memory is allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the benchmark results are not very convincing 😥
Indeed, maybe we are not on the right track... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked further into the contents of inputs = map(x -> val(_lookup(tf, x)), instr.input)
which is one of the allocations culprits.
It turned out that the eltypes of inputs
can be a DataType
, Box
, Matrix
and many more types. This means that input
will become a vector of pointers and all elements need to be allocated. Even when we re-write things so that the vector is allocated, we only reduce some allocations on the vector, but we do not reduce allocations on the elements towards which the pointers are pointing. Those will cost much more in terms of allocations since they are bigger.
I think the elements are already stored in |
mem profile result: https://paste.debian.net/1237788/ |
thanks, @KDr2 - can you list the hotspots and try to fix them? |
I tried to inspect on # const TypedFunction = FunctionWrapper
struct TypedFunction{OT, IT<:Tuple}
func::Function
retval::Base.RefValue{OT}
TypedFunction{OT, IT}(f::Function) where {OT, IT<:Tuple} = new{OT, IT}(f, Ref{OT}())
end
function (f::TypedFunction{OT, IT})(args...) where {OT, IT<:Tuple}
output = f.func(args...)
# 1. conversion/type-assertion -> allocation
retv = f.retval
retv[] = OT === Nothing ? nothing : output
return retv[]
end
getter(bindings::Dict, v::Symbol) = bindings[v]
setter(bindings::Dict, v::Symbol, c) = bindings[v] = c
bs = Dict(:a=>1, :b=>"x")
gint = TypedFunction{Int, Tuple{Dict, Symbol}}(getter)
gstr = TypedFunction{String, Tuple{Dict, Symbol}}(getter)
sint = TypedFunction{Int, Tuple{Dict, Symbol, Int}}(setter)
sstr = TypedFunction{String, Tuple{Dict, Symbol, String}}(setter)
using BenchmarkTools
@btime gint(bs, :a)
@btime gstr(bs, :b)
@btime sint(bs, :a, 1)
@btime sstr(bs, :b, "y") Both |
The latest commit eliminates allocations from |
thanks @KDr2 - nice improvements. I included the results below. Allocations for Results on master and with e1529f1 (below) benchmarking rosenbrock...
Run Original Function: 667.309 μs (16 allocations: 6.10 MiB)
Run TapedFunction: 845.212 μs (114 allocations: 6.11 MiB)
Run TapedFunction (compiled): 780.911 μs (143 allocations: 6.11 MiB)
Run TapedTask: #produce=1; 1.150 ms (234 allocations: 6.12 MiB)
benchmarking ackley...
Run Original Function: 2.042 ms (0 allocations: 0 bytes)
Run TapedFunction: 915.595 ms (6798568 allocations: 205.97 MiB)
Run TapedFunction (compiled): 856.984 ms (8798595 allocations: 367.72 MiB)
Run TapedTask: #produce=100000; 2.121 s (7298712 allocations: 216.66 MiB)
benchmarking matrix_test...
Run Original Function: 152.102 μs (16 allocations: 469.22 KiB)
Run TapedFunction: 187.503 μs (136 allocations: 473.12 KiB)
Run TapedFunction (compiled): 193.502 μs (168 allocations: 476.05 KiB)
Run TapedTask: #produce=1; 470.607 μs (268 allocations: 483.55 KiB)
benchmarking neural_net...
Run Original Function: 706.436 ns (4 allocations: 576 bytes)
Run TapedFunction: 5.117 μs (21 allocations: 1.19 KiB)
Run TapedFunction (compiled): 4.414 μs (31 allocations: 2.02 KiB)
Run TapedTask: #produce=1; 185.902 μs (113 allocations: 6.56 KiB) |
The latest results:
|
Time consumption and memory allocation per instruction: https://paste.debian.net/1241862/. |
It seems that this change (impact vector bindings)brings a little performance decrease in the Turing.jl unit tests tho it's done during compile time, maybe there are a large number of tapedfunctions generated during the test. BTW, docs are added. |
Many thanks, @KDr2 for the hard work! Also thanks to @devmotion and @rikhuijzer for the comments! |
The first try is using generated function to run instruction.