Document design of tier 2 engines (#640)

markshannon · mdboom · JelleZijlstra · web-flow · commit 7587e6ece8b2 · 2024-01-03T16:36:29.000Z
Co-authored-by: Michael Droettboom &lt;mdboom@gmail.com&gt;
Co-authored-by: Jelle Zijlstra &lt;jelle.zijlstra@gmail.com&gt;
diff --git a/3.13/README.md b/3.13/README.md
@@ -39,9 +39,12 @@ The workplan is roughly as follows:
 
 Our goal for 3.13 is to reduce the time spent in the interpreter by at least 50%.
 
-[Detailed plan](https://github.com/faster-cpython/ideas/issues/587).
+[Issue](https://github.com/faster-cpython/ideas/issues/587).
+[Execution engine](./engine.md).
 [Detailed plan for copy-and-patch](https://github.com/faster-cpython/ideas/issues/588).
 
+
+
 ### Enabling subinterpreters from Python
 
 Unlike the other tasks, which are mainly focused on single-threaded performance, this work builds on the per-interpreter GIL work that shipped in Python 3.12 to allow Python programmers to take advantage of better parallelism in subinterpreters from Python code (without the need to write a C extension).
diff --git a/3.13/engine.md b/3.13/engine.md
@@ -0,0 +1,270 @@
+# Tier 2 execution engine
+
+Author: Mark Shannon
+
+The [plan for 3.13](./README.md) describes the main components that we will produce for 3.13. This document explains how the bits fit together at runtime.
+
+## Overview
+
+The design of the tier 2 optimizer is based around "superblock"s.
+These superblocks are a specification of execution,
+rather than something that is executed.
+
+We expect to have two different execution engines for superblocks:
+* The tier 2 interpreter
+* The copy-and-patch compiler
+
+We need to manage entry to and exits from superblocks, so that
+as programs execute, a graph of superblocks will be built up.
+
+While it is critical for performance that execution within
+a superblock is fast, the performance of jumping from one
+superblock to another is also important.
+
+Memory consumption is also important.
+
+### Either interpreter or compiler. Not both.
+
+We assume that there will only be one execution engine.
+If we have a JIT, we will use it to execute *all* superblocks.
+This simplifies things as we do not need to concern ourself with
+transfers from the tier-2-interpreter to JIT-compiled code, or vice versa.
+
+### Superblocks and executors
+
+A superblock is a sequence of micro-ops with minimal control flow.
+It is the input and output of tier 2 optimization phases.
+An executor is the runtime object that is the executable
+representation of a superblock. 
+
+#### Creating executors
+
+Creating executors from superblocks is the first job of the execution engine.
+This transformation is currently trivial for the tier 2 interpreter as we
+use the same format for optimization and execution. This is inefficient, so
+we will probably change the executable format.
+For the JIT, we will use copy-and-patch to convert the micro-ops to machine code.
+
+### Exits from executors
+
+On creation of an executor, there will be a number of potential exits
+from that executor. For each entry to an executor, exactly one exit
+will occur. This means that a few of the exits will be hot, but most will be cold.
+
+We want hot exits to be implemented such that transfer to the next executor is fast.
+However, we want cold exits to consume as little memory as possible.
+Unfortunately we cannot know in advance which are which, although there are
+some exits we expect to be very rare. We can handle these very rare exits
+differently, to save space.
+
+### Linking executors
+
+When an exit from an executor gets hot we need to create a new executor
+to attach to that exit. Once we have done that we need to link the exit to 
+the new executor in a way that allows fast transfer of execution.
+
+### Making progress
+
+If an executor exits before it does any work *and* it exits to another
+executor that can also exit before it does any work, or it exits to itself,
+we can find ourselves in an infinite loop.
+
+There are two possible approaches to avoiding this:
+1. We track which executors might not make
+progress and are careful about which executors are linked together, or
+2. We require all executors to make progress.
+
+1 has many edge cases which makes it hard to reason about and get correct.
+
+2 is simpler, but may be a bit slower as we may need to [pessimize the first instruction](#guaranteeing-progress-within-an-executor).
+
+We will use approach 2. It is much easier to reason about and should be almost as fast.
+
+If all executors are guaranteed to make progress, then transfers from one executor 
+to another can be implemented by a single jump/tail-call.
+
+### Inter-superblock optimization
+
+Some superblocks can be quite short, but form a larger region of hot code, that 
+we want to optimize. In order to do that we want to propagate type information
+and representation changes across edges, which means storing that information
+on exits when they are cold. Since most exits will remain cold, we need to 
+store this information in a compact form.
+
+### Making "hot" exits fast and "cold" exits small
+
+In order to make hot exits fast, we need them to be as simple as possible,
+passing as little information as possible.
+
+We also want to make cold exits as small as possible, but cold exits
+may become hot exits, so we need them both to use the same interface.
+
+## The implementation
+
+This section describes one possible implementation. The initial implementation
+will not support inter-superblock optimization, but we should plan to support it
+in the future.
+
+### Making "hot" exits fast
+
+In order to make hot exits fast, we want to implement them as a single, unconditional jump,
+with the minimum bookkeeping to maintain refcounts.
+In the tier 2 interpreter, the jump will be implemented by setting the IP to the first
+instruction of the target executor.
+In the JIT, the jump will be implemented as a tail call to the function pointer of the target executor.
+
+### Making "cold" exits small
+
+All side exits from an executor will start cold, and most of them will remain cold.
+
+We need to track various pieces of information for cold exits, none of which will be
+required once the exit becomes hot, so we want to store that information in a way
+that minimizes the cost to hot exits and minimizes the memory used for cold exits.
+That information is:
+* The offset/location of the pointer to the exit in the executor, so it can be updated.
+* The target (offset into the code object of the tier 1 instruction)
+* Any relevant known type information (this is optional but will improve optimization)
+* Any representation changes that have been made.
+* The "hotness" counter
+
+### Minimizing memory use
+
+There are number of ways we can reduce memory use, for example:
+* Putting things in (ideally const) arrays, so we can refer to them by a one or two byte index,
+  instead of an eight byte pointers.
+* For complex data like type information and representation changes, use trees or deltas,
+  so that the information like `(A, B, C)` can be stored as `(A, B) <- C` allowing `(A, B)` to
+  be shared.
+
+### Each executor gets a table of exit data
+
+Each executor will get a table of exit data. We can compute the size of this
+when creating the executor. The entries in this table are described below.
+
+### Fixed number of exit objects
+
+Since the vast majority of exits will not need to be modified (only those that get hot),
+we do not want to pass the offset of the exit, so we need to store the offset in the executor.
+
+Since there is a fixed number of micro-ops allowed in a superblock (currently 512), we have an upper
+bound on the offset. We will preallocate one exit object per possible offset.
+
+### Exit data
+
+* The offset/location of the exit pointer: Not needed if the exit pointer is stored in the exit data.
+* The target: Has a maximum value of about 10**9, so store as a `uint32_t`
+* Any relevant known type information: Format TBD.
+  We don't need to decide on the format of this data for now.
+* Any representation changes that have been made: Likewise, format TDB.
+* The "hotness" counter. See below.
+
+#### Representation changes
+
+Representation changes are things like storing the top value(s) of the stack in registers, or performing scalar replacement on objects or frames.
+In order to drop back to tier1, we need to be able to undo those changes, so we need to record them.
+
+#### Hotness counters
+
+We need to track how hot an exit is. Ideally we want to be able to distinguish
+between an exit that is hot, and one that is cold but has exited a large number of
+times over prolonged execution.
+
+There are three ways we can store counters:
+1. Store a counter for each exit in the superblock.
+2. Have one exit per possible value of the counter, and change the exit object to change the counter.
+3. Store the counters in a global (per-interpreter) table. LuaJIT does something like this (but with a very small table).
+<!-- Prevent 1 below being numbered as 4 -->
+1. is simple and deterministic.
+2. either prevents mapping the exits to offsets, or will require many thousands
+   of exit objects (number_of_offsets * max_counter_value)
+3. will require hashing the executor and offset, and thus be non-determinstic due to collisions.
+   Allows us to decay the counter to distinguish between hot counters and long-lived "cool" counters,
+   by halving all the counters every tick. Where a tick might be 1ms, or might vary depending on memory consumption.
+   
+
+We should probably start with option 1, with the intention of 
+implementing option 3 later.
+
+### `EXIT_IF` and `UNLIKELY_EXIT_IF`
+
+We will replace the misnamed `DEOPT_IF` (it doesn't de-optimize, it exits)
+with `EXIT_IF` and `UNLIKELY_EXIT_IF`.
+`EXIT_IF` will be used for most exits.
+`UNLIKELY_EXIT_IF` will be used for unlikely exits, like eval breaker checks or stack overflow.
+
+`EXIT_IF` will exit through the pointer, as described above.
+`UNLIKELY_EXIT_IF` will exit to tier 1.
+
+For efficiency reasons, all exits in the same micro op will use the same exit object and data.
+For simplicity, we will also require that no micro op contains both
+an `EXIT_IF` and an `UNLIKELY_EXIT_IF`.
+
+Since `UNLIKELY_EXIT_IF` has no attached exit/executor it is compact;
+it only needs two bytes to store the target.
+
+### Guaranteeing progress
+
+We need to guarantee progress within an executor and when exiting executors.
+
+#### Guaranteeing progress within an executor
+
+In order to guarantee progress, the superblock cannot `EXIT_IF` without having made progress,
+but it can do an `UNLIKELY_EXIT_IF` since `UNLIKELY_EXIT_IF` always drops to tier 1.
+
+This means that the micro ops for the first tier 1 instruction in an executor
+cannot contain an `EXIT_IF`. In practice this means that before translating
+to micro-ops, the first instruction must be converted to its unspecialized form.
+
+### Exiting to invalid executors
+
+We want `EXIT_IF` to be efficient, so it needs to enter the next executor unconditionally.
+This means that when we invalidate an executor, we need to modify that executor so that
+it immediately drops into tier 1 upon being executed. Since tier 1 will never enter an invalid
+executor, we are thus guaranteed not to execute an invalid executor.
+In the tier 2 interpreter, we change the first instruction to `EXIT_TRACE`.
+In the JIT, we will need to change the function pointer to point to a function that returns
+to tier 1.
+
+### The mechanics of transferring execution between executors
+
+When transfering control we need to:
+1. Incref the new executor
+2. Set `current_executor` to the new executor 
+3. Decref the old executor.
+
+In the future, when we have deferred references, we can make the current executor a deferred reference
+and skip the incref/decref.
+
+#### JIT compiler
+
+To transfer control between executors, we make a jump, implemented as an indirect
+tail-call in the generated stencil.
+
+The generated code must decref the old executor, as we cannot decref the old executor when still
+executing it, meaning we must pass the old executor as an argument in the tail call.
+
+
+#### Interpreter 
+
+We can do the three steps (incref, update current executor, decref) before entering the new executor, since then we don't need to worry about freeing code that we are running.
+Entering the new executor is simple: set the 
+instruction pointer to the first instruction of the new executor.
+
+## Future optimizations
+
+We plan to leave inter-executor optimizations for the future, in order to get a working
+implementation with JIT compiler ready in good time for 3.13.
+
+### Specialization across executors
+
+For this we will need to record known type information at exits, to avoid redundant checks,
+and to allow us to create multiple specialized executors for the same tier 1 instructions.
+
+### Representation changes across executors
+
+By tracking representation changes across executors, we can avoid the overhead of restoring
+the canonical representation on exits.
+For example, if a value is represented by an unboxed float, it is expensive to box, then unbox it
+across an exit.
+This optimization has the potential to be a significant performance win
+*and* to consume a lot of memory, so we need to design our data structures carefully.