-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Loop Optimisation Opportunities #156
Comments
I still need to add more concrete info about what loop optimisations are possible, but here's a summary of the state of affairs currently:
That we just rely on everything boiling down to the same kind of looping structure in the CFG is a great advantage of this approach -- basically everything CPU-based that’s performant gets reduced to a loop in the CFG (specifically, a thing called a “Natural Loop” in compiler optimisation terminology). There are well-established optimisation strategies for loops, so we don’t need to implement separate rules for all the different higher-order functions to get good performance, nor do we need to tell people to steer clear of writing for or while loops. (The situation in which this strategy breaks down is if people use |
Tapir.jl does not perform as well as it could on functions like the following: function foo!(y::Vector{Float64}, x::Vector{Float64})
@inbounds @simd for n in eachindex(x)
y[n] = y[n] + x[n]
end
return y
end For example, on my computer: y = randn(4096)
x = randn(4096)
julia> @benchmark foo!($y, $x)
BenchmarkTools.Trial: 10000 samples with 173 evaluations.
Range (min … max): 547.150 ns … 3.138 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 646.633 ns ┊ GC (median): 0.00%
Time (mean ± σ): 682.488 ns ± 116.548 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▂
▁▁▂▄▇████▇▇▇▆▆▅▅▅▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
547 ns Histogram: frequency by time 1.18 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
rule = Tapir.build_rrule(foo!, y, x);
foo!_d = zero_fcodual(foo!)
y_d = zero_fcodual(y)
x_d = zero_fcodual(x)
out, pb!! = rule(foo!_d, y_d, x_d);
julia> @benchmark ($rule)($foo!_d, $y_d, $x_d)[2](NoRData())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 64.042 μs … 202.237 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 78.675 μs ┊ GC (median): 0.00%
Time (mean ± σ): 75.763 μs ± 10.175 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ ▇ ▇ ▄▂ ▅ ▃ ▆ ▅▆ █▄ ▄ ▁▂ ▂ ▁ ▂ ▃
█▃█▃█▄██▁▄▁▆▄▁▁█▄█▆██████████▇▄██▃▅█▃▃█▆▆▇█▆▆▆▇▆▄▁█▃▃▃▅█▅▃▁█ █
64 μs Histogram: log(frequency) by time 108 μs <
Memory estimate: 0 bytes, allocs estimate: 0. So the performance ratio is roughly Note that this is not due to type-instabilities. One way to convince yourself of this is that there are no allocations required to run AD, which would most certainly not be the case were there type instabilities. To see this, take a look at the optimised IR for 2 1 ── %1 = Base.arraysize(_3, 1)::Int64 │╻╷╷╷╷ macro expansion
│ %2 = Base.slt_int(%1, 0)::Bool ││╻╷╷╷╷ eachindex
│ %3 = Core.ifelse(%2, 0, %1)::Int64 │││╻ axes1
│ %4 = %new(Base.OneTo{Int64}, %3)::Base.OneTo{Int64} ││││┃││││ axes
└─── goto #14 if not true │╻ macro expansion
2 ── %6 = Base.slt_int(0, %3)::Bool ││╻ <
└─── goto #12 if not %6 ││
3 ── nothing::Nothing │
4 ┄─ %9 = φ (#3 => 0, #11 => %27)::Int64 ││
│ %10 = Base.slt_int(%9, %3)::Bool ││╻ <
└─── goto #12 if not %10 ││
5 ── %12 = Base.add_int(%9, 1)::Int64 ││╻╷ simd_index
└─── goto #9 if not false │││╻ getindex
6 ── %14 = Base.slt_int(0, %12)::Bool ││││╻ >
│ %15 = Base.sle_int(%12, %3)::Bool ││││╻ <=
│ %16 = Base.and_int(%14, %15)::Bool ││││╻ &
└─── goto #8 if not %16 ││││
7 ── goto #9 │
8 ── invoke Base.throw_boundserror(%4::Base.OneTo{Int64}, %12::Int64)::Union{}
└─── unreachable ││││
9 ┄─ goto #10 │
10 ─ goto #11 │
11 ─ %23 = Base.arrayref(false, _2, %12)::Float64 ││╻╷ macro expansion
│ %24 = Base.arrayref(false, _3, %12)::Float64 │││┃ getindex
│ %25 = Base.add_float(%23, %24)::Float64 │││╻ +
│ Base.arrayset(false, _2, %25, %12)::Vector{Float64} │││╻ setindex!
│ %27 = Base.add_int(%9, 1)::Int64 ││╻ +
│ $(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))::Nothing │╻ macro expansion
└─── goto #4 ││
12 ┄ goto #14 if not false ││
13 ─ nothing::Nothing │
5 14 ┄ return _2 │ The performance-critical chunk of the loop happens between %23_ = rrule!!(zero_fcodual(Base.arrayref), zero_fcodual(false), _2, %12)
%23 = %23[1]
push!(%23_pb_stack, %23[2]) In short, we run the rule, pull out the first element of the result, and push the pullback to the stack for use on the reverse-pass. So there is at least one really large obvious source of overhead here: pushing to / popping from the stacks. If you take a look at the pullbacks for the
This information is necessary for AD, but
What's not obvious here, but is also important, is that the call to Now, Tapir.jl is implemented in such a way that, if the pullback for a particular function is a singleton / doesn't carry around any information, the associated pullback stack is eliminated entirely. Moreover, just reducing the amount of memory stored at each iteration should reduce memory pressure. Consequently, a good strategy for making progress is to figure out how to reduce the amount of stuff which gets stored in the pullback stacks. The two points noted above provide obvious starting points. Making use of loop invariantsIn short: ammend the rule interface such that the arguments to the forwards pass are also made available on the reverse pass. For example, the function rrule!!(::CoDual{typeof(arrayref)}, inbounds::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
_ind = primal(ind)
dx = tangent(x)
function arrayref_pullback(dy)
dx[_ind] += dy
return NoRData(), NoRData(), NoRData(), NoRData()
end
return CoDual(primal(x)[_ind], tangent(x)[_ind]), arrayref_pullback
end This skips some details, but the important point is that Under the new interface, this would look something like function rrule!!(::CoDual{typeof(arrayref)}, inbounds::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
function arrayref_pullback(dy, ::CoDual{typeof(arrayref)}, ::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
_ind = primal(ind)
dx = tangent(x)
dx[_ind] += dy
return NoRData(), NoRData(), NoRData(), NoRData()
end
return CoDual(primal(x)[_ind], tangent(x)[_ind]), arrayref_pullback
end In this version of the rule, So this interface change frees up Tapir.jl to provide the arguments on the reverse-pass in whichever way it pleases. In this particular example, both It's impossible to know for sure how much of an effect this would have, but doing this alone would more than halve the memory requirement for Induction Variable AnalysisI won't address how we could make use of induction variable analysis here because I'm still trying to get my head around exactly how is easiest to go about it. |
Another obvious optimisation is to analyse the trip count, and pre-allocate the (necessary) pullback stacks in order to avoid branching during execution (i.e. checking that they're long enough to store the next pullback, and allocating more memory if not). This is related to induction variable analysis, so we'd probably want to do that first. Doing this kind of optimisation would enable vectorisation to happen more effectively in AD, as would could completely eliminate branching from a number of tight loops. |
Good investigations; it's probably okay to keep this issue open instead of transferring discussions here into docs. |
No description provided.
The text was updated successfully, but these errors were encountered: