Description
I want to separate out high-level discussion from that PR to this issue. The PR comment chain should henceforth mostly be for discussion of the implementation. My reasoning is that seeing both mixed with commits makes things hard to follow.
I want to summarise my thoughts below. This is in response to @markshannon's comment at python/cpython#114059 (comment)
Addressing the current concerns
Performance
For some reason that I have not yet determined, the run in that PR has the optimizer failing more often than expected. The latest run suggests a 1% speedup, especially benefiting numeric code. My own local benchmarks suggest 0-4% speedup on default pyperformance (the benchmark runner here is a little different because it pulls pyston benchmarks as well).
A lack of generality / Abstract Interpreter DSL plans
There are concerns with whether the abstract interpreter generated in that PR is too specific to a single use case. If you want a comparison, Mark wrote a proof-of-concept, but not-yet-working abstract interpreter DSL here at his branch.
The future plan is to have a similar sort of DSL to what Mark has for rewrite rules and peephole optimizations on uop sequences. It is elegant and IMO a very clean way of expressing rules.
My current plan however is to keep the current abstract interpreter in the PR as-is, and then in the future, augment the abstract interpreter cases with the DSL. I have a very simple reason: the current abstract interpreter represents what can be inferred automatically from the bytecodes definition (apart from the type annotations). It reduces the amount of DSL we have to maintain. It does not make sense to write DSL rules for constant evaluation when something like that can be inferred automatically (by definition of pure op). The vast majority of bytecode follow simple rules that do not need an abstract interpreter DSL to define.
That said, I do definitely want to adopt Mark's approach when optimizing the bytecode. In short, I want to combine both approaches and maximize what I can automatically infer, and anything else leave to DSL rules for optimization.
This also answers the concerns for extensibility.
What does the PR actually offer?
Assuming we don't merge the current DSL's abstract interpreter generator.
Data structures
The core thing is that it sets up the data structures required for an abstract interpreter. I cannot oversell this enough. This is subtle, but the abstract interpreter data structure itself is the reduced version of the full optimizer I originally wrote. I made various design decisions to make optimization simpler with the benefit of hindsight (and 3 rewrites :). .For example, its state already models inlined calls. This is why it uses a single contiguous stack to share separate frames's localsplus, so that calculation for inlined locals is made simple. A lot of experimentation has been put into the infrastructure (and admittedly less so the DSL, which has much room for improvement as mentioned above).
Type propagation
The next thing is that it offers an effective type analyzer. On the latest branch, _GUARD_BOTH_INT
is cut by over 70%. Mark has said it's not an instruction we should focus optimization on. I fully agree. I am purely using its statistics as a proxy to see how effective our type propagation is.
Poor man's loop invariant code motion for guards
I called this loop peeling originally, but it gave the wrong idea. The main reason why I repeat the body of a loop a second time is because during the first round, types are not yet known by the type analyzer. Consider this:
_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_JUMP_TO_TOP
Running the type propagator one more time and duplicating the body gives this code
_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_GUARD_BOTH_INT
_BINARY_OP_ADD_INT
_JUMP_ABSOLUTE_HEADER
# New loop body
_ITER_NEXT_RANGE
_LOAD_FAST
_LOAD_FAST
_BINARY_OP_ADD_INT
_JUMP_ABSOLUTE (to after _JUMP_ABSOLUTE_HEADER)
Note: the main loop is now completely guard-free! This is the benefit of "peeling" the loop. It's a poor man's loop invariant code motion for guards.
This doubled the number of float and int guards we reduce at runtime, from roughly 15% to roughly 30%, and roughly 35% to roughly 70% respectively.