Peformance Optimization #138

KDr2 · 2022-04-02T01:15:14Z

The first try is using generated function to run instruction.

KDr2 · 2022-04-02T02:49:03Z

Benchmarks on master:

benchmarking rosenbrock...
  Run Original Function:  357.635 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  507.503 μs (298 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  509.553 μs (327 allocations: 6.12 MiB)
  Run TapedTask: #produce=1;   570.172 μs (423 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  1.278 ms (0 allocations: 0 bytes)
  Run TapedFunction:  731.830 ms (17798224 allocations: 479.10 MiB)
  Run TapedFunction (compiled):  684.299 ms (19798251 allocations: 640.85 MiB)
  Run TapedTask: #produce=100000;   1.393 s (18798365 allocations: 502.00 MiB)
benchmarking matrix_test...
  Run Original Function:  125.608 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  145.718 μs (338 allocations: 478.50 KiB)
  Run TapedFunction (compiled):  142.328 μs (370 allocations: 481.42 KiB)
  Run TapedTask: #produce=1;   217.167 μs (477 allocations: 489.12 KiB)
benchmarking neural_net...
  Run Original Function:  492.670 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.017 μs (91 allocations: 3.66 KiB)
  Run TapedFunction (compiled):  4.561 μs (101 allocations: 4.48 KiB)
  Run TapedTask: #produce=1;   62.029 μs (191 allocations: 9.30 KiB)
done

Benchmarks on this PR:

benchmarking rosenbrock...
  Run Original Function:  362.513 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  490.431 μs (266 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  473.882 μs (295 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   564.089 μs (390 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  1.232 ms (0 allocations: 0 bytes)
  Run TapedFunction:  739.744 ms (15598196 allocations: 456.21 MiB)
  Run TapedFunction (compiled):  694.899 ms (17598223 allocations: 617.96 MiB)
  Run TapedTask: #produce=100000;   1.470 s (16498337 allocations: 477.59 MiB)
benchmarking matrix_test...
  Run Original Function:  124.928 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  146.478 μs (322 allocations: 478.81 KiB)
  Run TapedFunction (compiled):  142.938 μs (354 allocations: 481.73 KiB)
  Run TapedTask: #produce=1;   221.376 μs (457 allocations: 489.33 KiB)
benchmarking neural_net...
  Run Original Function:  478.605 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.212 μs (81 allocations: 3.52 KiB)
  Run TapedFunction (compiled):  4.806 μs (91 allocations: 4.34 KiB)
  Run TapedTask: #produce=1;   63.829 μs (210 allocations: 10.17 KiB)
done

The allocation only reduced A LITTLE. There must be some other places where memory is allocated.

devmotion

Hmm the benchmark results are not very convincing 😥

src/tapedfunction.jl

KDr2 · 2022-04-02T07:18:58Z

Hmm the benchmark results are not very convincing disappointed_relieved

Indeed, maybe we are not on the right track...

src/tapedfunction.jl

rikhuijzer

I've looked further into the contents of inputs = map(x -> val(_lookup(tf, x)), instr.input) which is one of the allocations culprits.

It turned out that the eltypes of inputs can be a DataType, Box, Matrix and many more types. This means that input will become a vector of pointers and all elements need to be allocated. Even when we re-write things so that the vector is allocated, we only reduce some allocations on the vector, but we do not reduce allocations on the elements towards which the pointers are pointing. Those will cost much more in terms of allocations since they are bigger.

src/tapedfunction.jl

KDr2 · 2022-04-04T00:06:03Z

but we do not reduce allocations on the elements towards which the pointers are pointing.

I think the elements are already stored in tf.bindings, but the way we are passing them prevents updating them in place?

KDr2 · 2022-04-13T02:16:56Z

mem profile result: https://paste.debian.net/1237788/

yebai · 2022-04-13T20:28:26Z

mem profile result: https://paste.debian.net/1237788/

thanks, @KDr2 - can you list the hotspots and try to fix them?

KDr2 · 2022-04-27T01:51:07Z

I tried to inspect on TypedFunction:

# const TypedFunction = FunctionWrapper
struct TypedFunction{OT, IT<:Tuple}
    func::Function
    retval::Base.RefValue{OT}
    TypedFunction{OT, IT}(f::Function) where {OT, IT<:Tuple} = new{OT, IT}(f, Ref{OT}())
end

function (f::TypedFunction{OT, IT})(args...) where {OT, IT<:Tuple}
    output = f.func(args...)
    # 1. conversion/type-assertion -> allocation
    retv = f.retval
    retv[] = OT === Nothing ? nothing : output
    return retv[]
end


getter(bindings::Dict, v::Symbol) = bindings[v]
setter(bindings::Dict, v::Symbol, c) = bindings[v] = c

bs = Dict(:a=>1, :b=>"x")
gint = TypedFunction{Int, Tuple{Dict, Symbol}}(getter)
gstr = TypedFunction{String, Tuple{Dict, Symbol}}(getter)

sint = TypedFunction{Int, Tuple{Dict, Symbol, Int}}(setter)
sstr = TypedFunction{String, Tuple{Dict, Symbol, String}}(setter)

using BenchmarkTools

@btime gint(bs, :a)
@btime gstr(bs, :b)
@btime sint(bs, :a, 1)
@btime sstr(bs, :b, "y")

Both BenchmarkTools and --track-allocation report no mem allocation.

KDr2 · 2022-04-27T07:18:47Z

The latest commit eliminates allocations from return retval[] and Symbol("_", i + 1).

yebai · 2022-04-27T08:03:53Z

thanks @KDr2 - nice improvements. I included the results below. Allocations for ackley went down to 6.7 million from 17 million. Let's keep optimising the remaining hotspots.

Results on master and with e1529f1 (below)

benchmarking rosenbrock...
  Run Original Function:  667.309 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  845.212 μs (114 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  780.911 μs (143 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   1.150 ms (234 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  2.042 ms (0 allocations: 0 bytes)
  Run TapedFunction:  915.595 ms (6798568 allocations: 205.97 MiB)
  Run TapedFunction (compiled):  856.984 ms (8798595 allocations: 367.72 MiB)
  Run TapedTask: #produce=100000;   2.121 s (7298712 allocations: 216.66 MiB)
benchmarking matrix_test...
  Run Original Function:  152.102 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  187.503 μs (136 allocations: 473.12 KiB)
  Run TapedFunction (compiled):  193.502 μs (168 allocations: 476.05 KiB)
  Run TapedTask: #produce=1;   470.607 μs (268 allocations: 483.55 KiB)
benchmarking neural_net...
  Run Original Function:  706.436 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.117 μs (21 allocations: 1.19 KiB)
  Run TapedFunction (compiled):  4.414 μs (31 allocations: 2.02 KiB)
  Run TapedTask: #produce=1;   185.902 μs (113 allocations: 6.56 KiB)

src/tapedfunction.jl

KDr2 · 2022-05-12T11:15:55Z

The latest results:

benchmarking rosenbrock...
  Run Original Function:  661.008 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  817.910 μs (95 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  779.715 μs (124 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   1.147 ms (175 allocations: 6.11 MiB)
benchmarking ackley...
  Run Original Function:  2.018 ms (0 allocations: 0 bytes)
  Run TapedFunction:  607.439 ms (6098555 allocations: 140.36 MiB)
  Run TapedFunction (compiled):  527.161 ms (8098582 allocations: 233.44 MiB)
  Run TapedTask: #produce=100000;   1.793 s (6598652 allocations: 149.52 MiB)
benchmarking matrix_test...
  Run Original Function:  147.702 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  170.502 μs (106 allocations: 471.50 KiB)
  Run TapedFunction (compiled):  171.903 μs (138 allocations: 473.03 KiB)
  Run TapedTask: #produce=1;   448.807 μs (189 allocations: 479.00 KiB)
benchmarking neural_net...
  Run Original Function:  700.721 ns (4 allocations: 576 bytes)
  Run TapedFunction:  3.413 μs (18 allocations: 944 bytes)
  Run TapedFunction (compiled):  2.911 μs (28 allocations: 1.39 KiB)
  Run TapedTask: #produce=1;   179.703 μs (99 allocations: 5.62 KiB)
done

KDr2 · 2022-05-24T15:00:59Z

Time consumption and memory allocation per instruction: https://paste.debian.net/1241862/.

src/tapedfunction.jl

KDr2 · 2022-06-08T11:13:52Z

It seems that this change (impact vector bindings)brings a little performance decrease in the Turing.jl unit tests tho it's done during compile time, maybe there are a large number of tapedfunctions generated during the test.

BTW, docs are added.

@yebai

src/tapedfunction.jl

yebai · 2022-06-15T19:53:50Z

Many thanks, @KDr2 for the hard work! Also thanks to @devmotion and @rikhuijzer for the comments!

use generated function to run instruction

16e4df1

KDr2 force-pushed the perf branch from 9f9081d to 16e4df1 Compare April 2, 2022 02:37

devmotion reviewed Apr 2, 2022

View reviewed changes

src/tapedfunction.jl Outdated Show resolved Hide resolved

rikhuijzer reviewed Apr 2, 2022

View reviewed changes

src/tapedfunction.jl Outdated Show resolved Hide resolved

rikhuijzer reviewed Apr 3, 2022

View reviewed changes

src/tapedfunction.jl Outdated Show resolved Hide resolved

optimize mem allocation

e1529f1

yebai mentioned this pull request May 3, 2022

[Expr] Don't use type info #136

Closed

optiomize copy and update_var!

10578ea

FredericWantiez mentioned this pull request May 7, 2022

Update args at copy #141

Merged

yebai and others added 2 commits May 10, 2022 21:54

Merge branch 'master' into perf

64924e4

optimization on constant

4e0e60c

devmotion reviewed May 12, 2022

View reviewed changes

src/tapedfunction.jl Outdated Show resolved Hide resolved

remove MacroTools

2d9ecd4

KDr2 force-pushed the perf branch from d46e5dc to 2d9ecd4 Compare May 12, 2022 06:50

KDr2 added 3 commits May 12, 2022 15:46

don't optimize function call

18e1629

opt const function-call conditionally

f8472b0

remove val

b6a4c77

KDr2 and others added 2 commits May 18, 2022 11:27

breakdown benchmarks

d1a2fea

Merge branch 'master' into perf

48355e1

KDr2 force-pushed the perf branch from 1ccb067 to d54a58c Compare May 25, 2022 12:00

rikhuijzer reviewed Jun 1, 2022

View reviewed changes

src/tapedfunction.jl Outdated Show resolved Hide resolved

KDr2 added 2 commits June 2, 2022 08:56

update from review

9675907

use a compact bindings, remove the offset limitation

3ab024f

KDr2 force-pushed the perf branch from 5f1552b to 3ab024f Compare June 8, 2022 02:18

bugfix: unify TypedSlot and SlotNumber

36a793e

KDr2 force-pushed the perf branch from cdbbe89 to 36a793e Compare June 8, 2022 03:07

KDr2 added 2 commits June 8, 2022 13:12

document

00f2553

refine types

a99b50c

KDr2 force-pushed the perf branch from dd5f6bc to a99b50c Compare June 8, 2022 11:15

give bindings a sizehint

124e2e1

KDr2 requested a review from yebai June 8, 2022 23:21

KDr2 linked an issue Jun 8, 2022 that may be closed by this pull request

The performance issue #137

Closed

KDr2 added 2 commits June 9, 2022 11:46

update docs for tape_copy

6c49595

remove outdated comments

dba0001

yebai reviewed Jun 11, 2022

View reviewed changes

KDr2 added 6 commits June 12, 2022 06:05

update from review

08f085a

disable logging by default

2811318

update comments

03d971b

use vector to store arg indices

9bafecf

rename fields in TF

20f49ef

remove TempBindings

d8f2371

KDr2 force-pushed the perf branch from 0db9ada to d8f2371 Compare June 15, 2022 03:47

yebai added 2 commits June 15, 2022 20:50

Update tapedfunction.jl

5343c34

Update Project.toml

7bf4f29

yebai merged commit 8ea2899 into master Jun 15, 2022

yebai deleted the perf branch June 15, 2022 19:53

This was referenced Jun 15, 2022

Lazy copy of entries in TapedFunction.bindings::Dict #131

Closed

Allow TapedFunction as a special Instruction #99

Closed

Peformance Optimization #138

Peformance Optimization #138

Uh oh!

Conversation

KDr2 commented Apr 2, 2022

Uh oh!

KDr2 commented Apr 2, 2022

Uh oh!

devmotion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KDr2 commented Apr 2, 2022

Uh oh!

Uh oh!

rikhuijzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KDr2 commented Apr 4, 2022

Uh oh!

KDr2 commented Apr 13, 2022

Uh oh!

yebai commented Apr 13, 2022

Uh oh!

KDr2 commented Apr 27, 2022

Uh oh!

KDr2 commented Apr 27, 2022

Uh oh!

yebai commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

KDr2 commented May 12, 2022

Uh oh!

KDr2 commented May 24, 2022

Uh oh!

Uh oh!

KDr2 commented Jun 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yebai commented Jun 15, 2022

Uh oh!

Uh oh!

yebai commented Apr 27, 2022 •

edited

Loading

KDr2 commented Jun 8, 2022 •

edited

Loading