Per our discussions, here's a simple hand-derived example in which handling loop-invariants + induction variables correctly should be enough to get us good performance.

Consider the following simple implementation of sum:

function _sum(f::F, x::AbstractArray{<:Real}) where {F}
    y = 0.0
    n = 0
    while n < length(x)
        n += 1
        y += f(x[n])
    end
    return y
end

On Julia LTS, it has the following IRCode:

julia> Base.code_ircode_by_type(Tuple{typeof(_sum), typeof(identity), Vector{Float64}})
1-element Vector{Any}:
   1 ─      nothing::Nothing                                      │ 
7  2 ┄ %2 = φ (#1 => 0, #3 => %7)::Int64                          │ 
   │   %3 = φ (#1 => 0.0, #3 => %9)::Float64                      │ 
   │   %4 = Base.arraylen(_3)::Int64                              │╻ length
   │   %5 = Base.slt_int(%2, %4)::Bool                            │╻ <
   └──      goto #4 if not %5                                     │ 
8  3 ─ %7 = Base.add_int(%2, 1)::Int64                            │╻ +
9  │   %8 = Base.arrayref(true, _3, %7)::Float64                  │╻ getindex
   │   %9 = Base.add_float(%3, %8)::Float64                       │╻ +
10 └──      goto #2                                               │ 
11 4 ─      return %3                                             │ 
    => Float64

It cycles between blocks 2 and 3 to perform the loop, exiting via block 4. In block 2, SSA values %4 and %5, and the goto if not that follows them determine whether to continue looping. In block 3, %7 increments the induction variable, %8 pulls a value out of the array, %9 adds it to the current running total, and then we return to block 2.

To understand what book-keeping is happening, it is enough to understand the forwards-pass IR generated by Mooncake. Our goal will be to eliminate the need for various book-keeping data structures in the forwards-pass -- once they're gone from here, they'll disappear from the reverse-pass also.

The fowards-pass IR generated by Mooncake is

5 1 ─ %1  = (Mooncake.get_shared_data_field)(_1, 1)::Mooncake.Stack{Int32}                                                               │
  │   %2  = (Mooncake.get_shared_data_field)(_1, 2)::Base.RefValue{Tuple{Mooncake.LazyZeroRData{typeof(_sum), Nothing}, Mooncake.LazyZeroRData{typeof(identity), Nothing}, Mooncake.LazyZeroRData{Vector{Float64}, Nothing}}}
  │   %3  = (Mooncake.get_shared_data_field)(_1, 3)::Mooncake.Stack{Tuple{Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}}}
  └──       (Mooncake.__assemble_lazy_zero_rdata)(%2, _2, _3, _4)::Core.Const((Mooncake.LazyZeroRData{typeof(_sum), Nothing}(nothing), Mooncake.LazyZeroRData{typeof(identity), Nothing}(nothing), Mooncake.LazyZeroRData{Vector{Float64}, Nothing}(nothing)))
  2 ─       (Mooncake.__push_blk_stack!)(%1, 11)::Core.Const(nothing)                                                                    │
  3 ┄ %6  = φ (#2 => Mooncake.CoDual{Int64, NoFData}(0, NoFData()), #4 => %19)::Mooncake.CoDual{Int64, NoFData}                          │
  │   %7  = φ (#2 => Mooncake.CoDual{Float64, NoFData}(0.0, NoFData()), #4 => %28)::Mooncake.CoDual{Float64, NoFData}                    │
  │   %8  = (identity)(_4)::Mooncake.CoDual{Vector{Float64}, Vector{Float64}}                                                            │
  │   %9  = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.arraylen), NoFData}(Mooncake.IntrinsicsWrappers.arraylen, NoFData()))), %8)::Tuple{Mooncake.CoDual{Int64, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.arraylen), Nothing}, Mooncake.LazyZeroRData{Vector{Float64}, Nothing}}}}
  │   %10 = (getfield)(%9, 1)::Mooncake.CoDual{Int64, NoFData}                                                                           │
  │   %11 = (identity)(%6)::Mooncake.CoDual{Int64, NoFData}                                                                              │
  │   %12 = (identity)(%10)::Mooncake.CoDual{Int64, NoFData}                                                                             │
  │   %13 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.slt_int), NoFData}(Mooncake.IntrinsicsWrappers.slt_int, NoFData()))), %11, %12)::Tuple{Mooncake.CoDual{Bool, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.slt_int), Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}}}}
  │   %14 = (getfield)(%13, 1)::Mooncake.CoDual{Bool, NoFData}                                                                           │
  │   %15 = (Mooncake.primal)(%14)::Bool                                                                                                 │
  └──       goto #5 if not %15                                                                                                           │
  4 ─ %17 = (identity)(%6)::Mooncake.CoDual{Int64, NoFData}                                                                              │
  │   %18 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.add_int), NoFData}(Mooncake.IntrinsicsWrappers.add_int, NoFData()))), %17, $(QuoteNode(Mooncake.CoDual{Int64, NoFData}(1, NoFData()))))::Tuple{Mooncake.CoDual{Int64, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.add_int), Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}}}}
  │   %19 = (getfield)(%18, 1)::Mooncake.CoDual{Int64, NoFData}                                                                          │
  │   %20 = (identity)(_4)::Mooncake.CoDual{Vector{Float64}, Vector{Float64}}                                                            │
  │   %21 = (identity)(%19)::Mooncake.CoDual{Int64, NoFData}                                                                             │
  │   %22 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Core.arrayref), NoFData}(Core.arrayref, NoFData()))), $(QuoteNode(Mooncake.CoDual{Bool, NoFData}(true, NoFData()))), %20, %21)::Tuple{Mooncake.CoDual{Float64, NoFData}, Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}}
  │   %23 = (getfield)(%22, 1)::Mooncake.CoDual{Float64, NoFData}                                                                        │
  │   %24 = (getfield)(%22, 2)::Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}                                        │
  │   %25 = (identity)(%7)::Mooncake.CoDual{Float64, NoFData}                                                                            │
  │   %26 = (identity)(%23)::Mooncake.CoDual{Float64, NoFData}                                                                           │
  │   %27 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.add_float), NoFData}(Mooncake.IntrinsicsWrappers.add_float, NoFData()))), %25, %26)::Tuple{Mooncake.CoDual{Float64, NoFData}, Mooncake.IntrinsicsWrappers.var"#add_float_pb!!#2"}
  │   %28 = (getfield)(%27, 1)::Mooncake.CoDual{Float64, NoFData}                                                                        │
  │   %29 = (tuple)(%24)::Tuple{Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}}                                       │
  │         (push!)(%3, %29)::Core.Const(nothing)                                                                                        │
  │         (Mooncake.__push_blk_stack!)(%1, 13)::Core.Const(nothing)                                                                    │
  └──       goto #3                                                                                                                      │
  5 ─       return %7

(before inlining). To orient yourself, observe that

for each basic block in the original code, there is one basic block in the forwards-pass, with a single additional block inserted at the top to handle some book-keeping. i.e. the counterpart of block 1 in the primal IR is block 2 in the forwards-pass IR.
each phi node in the primal has a corresponding phi node in the forwads-pass. e.g. %2 -> %6 and %3 - %7.
each goto node and goto-if-not node in the primal has a corresponding node in the forwards-pass.
each call / invoke expression in the primal corresponds to a collection of lines in the forwards-pass. There should be one call to rrule!! for each call to a "primitive" function (add_int, slt_int, etc).

The thing that we're interested in optimising away is the call to push! at the end of block #4. If you follow the chain of SSA values back up, you'll see that the the thing we push onto it is a Tuple containing only the pullback returned by the call to rrule!! for Core.arrayref (i.e. the low-level implementation of getindexforArray`s).

To understand what needs to happen here, observe that the type of the pullback for arrayref is Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64} -- it contains a Vector{Float64} and an Int64. If you inspect the implementation you will see that the Vector{Float64} is the tangent vector and the Int64 is element of the primal vector that we grab on the forwards-pass. This means that in order to make things more efficient, we need to find a different way to provide these quantities on the reverse-pass.

Happily, this is quite straightforward in principle. We can see in the primal code that the array being passed in to arrayref is_3 i.e. the third positional argument to the function. This array is definitely a loop invariant, so we need not store it at each iteration. Similarly, %7 is just the result of adding 1 to an induction variable -- this is also straightforward to handle once we've performed induction variable analysis (any loop-invariant affine transformation of an induction variable is straightforward to handle).

Once we've made it so that arrayref_pullback!! doesn't have to store these quantities for the reverse-pass itself, it can become a singleton. This will cause the complete removal of the stack associated to line %29, and deal with roughly half of all of the overhead in this function.

The remainder of the overhead is a discussion for later, but I believe that the majority of it just requires induction variable analysis.

Document Loop Optimisation Opportunities #156

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions