Skip to content

Commit 7587e6e

Browse files
markshannonmdboomJelleZijlstra
authored
Document design of tier 2 engines (#640)
Co-authored-by: Michael Droettboom <mdboom@gmail.com> Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
1 parent 0905dcc commit 7587e6e

File tree

2 files changed

+274
-1
lines changed

2 files changed

+274
-1
lines changed

3.13/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,12 @@ The workplan is roughly as follows:
3939

4040
Our goal for 3.13 is to reduce the time spent in the interpreter by at least 50%.
4141

42-
[Detailed plan](https://github.com/faster-cpython/ideas/issues/587).
42+
[Issue](https://github.com/faster-cpython/ideas/issues/587).
43+
[Execution engine](./engine.md).
4344
[Detailed plan for copy-and-patch](https://github.com/faster-cpython/ideas/issues/588).
4445

46+
47+
4548
### Enabling subinterpreters from Python
4649

4750
Unlike the other tasks, which are mainly focused on single-threaded performance, this work builds on the per-interpreter GIL work that shipped in Python 3.12 to allow Python programmers to take advantage of better parallelism in subinterpreters from Python code (without the need to write a C extension).

3.13/engine.md

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
# Tier 2 execution engine
2+
3+
Author: Mark Shannon
4+
5+
The [plan for 3.13](./README.md) describes the main components that we will produce for 3.13. This document explains how the bits fit together at runtime.
6+
7+
## Overview
8+
9+
The design of the tier 2 optimizer is based around "superblock"s.
10+
These superblocks are a specification of execution,
11+
rather than something that is executed.
12+
13+
We expect to have two different execution engines for superblocks:
14+
* The tier 2 interpreter
15+
* The copy-and-patch compiler
16+
17+
We need to manage entry to and exits from superblocks, so that
18+
as programs execute, a graph of superblocks will be built up.
19+
20+
While it is critical for performance that execution within
21+
a superblock is fast, the performance of jumping from one
22+
superblock to another is also important.
23+
24+
Memory consumption is also important.
25+
26+
### Either interpreter or compiler. Not both.
27+
28+
We assume that there will only be one execution engine.
29+
If we have a JIT, we will use it to execute *all* superblocks.
30+
This simplifies things as we do not need to concern ourself with
31+
transfers from the tier-2-interpreter to JIT-compiled code, or vice versa.
32+
33+
### Superblocks and executors
34+
35+
A superblock is a sequence of micro-ops with minimal control flow.
36+
It is the input and output of tier 2 optimization phases.
37+
An executor is the runtime object that is the executable
38+
representation of a superblock.
39+
40+
#### Creating executors
41+
42+
Creating executors from superblocks is the first job of the execution engine.
43+
This transformation is currently trivial for the tier 2 interpreter as we
44+
use the same format for optimization and execution. This is inefficient, so
45+
we will probably change the executable format.
46+
For the JIT, we will use copy-and-patch to convert the micro-ops to machine code.
47+
48+
### Exits from executors
49+
50+
On creation of an executor, there will be a number of potential exits
51+
from that executor. For each entry to an executor, exactly one exit
52+
will occur. This means that a few of the exits will be hot, but most will be cold.
53+
54+
We want hot exits to be implemented such that transfer to the next executor is fast.
55+
However, we want cold exits to consume as little memory as possible.
56+
Unfortunately we cannot know in advance which are which, although there are
57+
some exits we expect to be very rare. We can handle these very rare exits
58+
differently, to save space.
59+
60+
### Linking executors
61+
62+
When an exit from an executor gets hot we need to create a new executor
63+
to attach to that exit. Once we have done that we need to link the exit to
64+
the new executor in a way that allows fast transfer of execution.
65+
66+
### Making progress
67+
68+
If an executor exits before it does any work *and* it exits to another
69+
executor that can also exit before it does any work, or it exits to itself,
70+
we can find ourselves in an infinite loop.
71+
72+
There are two possible approaches to avoiding this:
73+
1. We track which executors might not make
74+
progress and are careful about which executors are linked together, or
75+
2. We require all executors to make progress.
76+
77+
1 has many edge cases which makes it hard to reason about and get correct.
78+
79+
2 is simpler, but may be a bit slower as we may need to [pessimize the first instruction](#guaranteeing-progress-within-an-executor).
80+
81+
We will use approach 2. It is much easier to reason about and should be almost as fast.
82+
83+
If all executors are guaranteed to make progress, then transfers from one executor
84+
to another can be implemented by a single jump/tail-call.
85+
86+
### Inter-superblock optimization
87+
88+
Some superblocks can be quite short, but form a larger region of hot code, that
89+
we want to optimize. In order to do that we want to propagate type information
90+
and representation changes across edges, which means storing that information
91+
on exits when they are cold. Since most exits will remain cold, we need to
92+
store this information in a compact form.
93+
94+
### Making "hot" exits fast and "cold" exits small
95+
96+
In order to make hot exits fast, we need them to be as simple as possible,
97+
passing as little information as possible.
98+
99+
We also want to make cold exits as small as possible, but cold exits
100+
may become hot exits, so we need them both to use the same interface.
101+
102+
## The implementation
103+
104+
This section describes one possible implementation. The initial implementation
105+
will not support inter-superblock optimization, but we should plan to support it
106+
in the future.
107+
108+
### Making "hot" exits fast
109+
110+
In order to make hot exits fast, we want to implement them as a single, unconditional jump,
111+
with the minimum bookkeeping to maintain refcounts.
112+
In the tier 2 interpreter, the jump will be implemented by setting the IP to the first
113+
instruction of the target executor.
114+
In the JIT, the jump will be implemented as a tail call to the function pointer of the target executor.
115+
116+
### Making "cold" exits small
117+
118+
All side exits from an executor will start cold, and most of them will remain cold.
119+
120+
We need to track various pieces of information for cold exits, none of which will be
121+
required once the exit becomes hot, so we want to store that information in a way
122+
that minimizes the cost to hot exits and minimizes the memory used for cold exits.
123+
That information is:
124+
* The offset/location of the pointer to the exit in the executor, so it can be updated.
125+
* The target (offset into the code object of the tier 1 instruction)
126+
* Any relevant known type information (this is optional but will improve optimization)
127+
* Any representation changes that have been made.
128+
* The "hotness" counter
129+
130+
### Minimizing memory use
131+
132+
There are number of ways we can reduce memory use, for example:
133+
* Putting things in (ideally const) arrays, so we can refer to them by a one or two byte index,
134+
instead of an eight byte pointers.
135+
* For complex data like type information and representation changes, use trees or deltas,
136+
so that the information like `(A, B, C)` can be stored as `(A, B) <- C` allowing `(A, B)` to
137+
be shared.
138+
139+
### Each executor gets a table of exit data
140+
141+
Each executor will get a table of exit data. We can compute the size of this
142+
when creating the executor. The entries in this table are described below.
143+
144+
### Fixed number of exit objects
145+
146+
Since the vast majority of exits will not need to be modified (only those that get hot),
147+
we do not want to pass the offset of the exit, so we need to store the offset in the executor.
148+
149+
Since there is a fixed number of micro-ops allowed in a superblock (currently 512), we have an upper
150+
bound on the offset. We will preallocate one exit object per possible offset.
151+
152+
### Exit data
153+
154+
* The offset/location of the exit pointer: Not needed if the exit pointer is stored in the exit data.
155+
* The target: Has a maximum value of about 10**9, so store as a `uint32_t`
156+
* Any relevant known type information: Format TBD.
157+
We don't need to decide on the format of this data for now.
158+
* Any representation changes that have been made: Likewise, format TDB.
159+
* The "hotness" counter. See below.
160+
161+
#### Representation changes
162+
163+
Representation changes are things like storing the top value(s) of the stack in registers, or performing scalar replacement on objects or frames.
164+
In order to drop back to tier1, we need to be able to undo those changes, so we need to record them.
165+
166+
#### Hotness counters
167+
168+
We need to track how hot an exit is. Ideally we want to be able to distinguish
169+
between an exit that is hot, and one that is cold but has exited a large number of
170+
times over prolonged execution.
171+
172+
There are three ways we can store counters:
173+
1. Store a counter for each exit in the superblock.
174+
2. Have one exit per possible value of the counter, and change the exit object to change the counter.
175+
3. Store the counters in a global (per-interpreter) table. LuaJIT does something like this (but with a very small table).
176+
<!-- Prevent 1 below being numbered as 4 -->
177+
1. is simple and deterministic.
178+
2. either prevents mapping the exits to offsets, or will require many thousands
179+
of exit objects (number_of_offsets * max_counter_value)
180+
3. will require hashing the executor and offset, and thus be non-determinstic due to collisions.
181+
Allows us to decay the counter to distinguish between hot counters and long-lived "cool" counters,
182+
by halving all the counters every tick. Where a tick might be 1ms, or might vary depending on memory consumption.
183+
184+
185+
We should probably start with option 1, with the intention of
186+
implementing option 3 later.
187+
188+
### `EXIT_IF` and `UNLIKELY_EXIT_IF`
189+
190+
We will replace the misnamed `DEOPT_IF` (it doesn't de-optimize, it exits)
191+
with `EXIT_IF` and `UNLIKELY_EXIT_IF`.
192+
`EXIT_IF` will be used for most exits.
193+
`UNLIKELY_EXIT_IF` will be used for unlikely exits, like eval breaker checks or stack overflow.
194+
195+
`EXIT_IF` will exit through the pointer, as described above.
196+
`UNLIKELY_EXIT_IF` will exit to tier 1.
197+
198+
For efficiency reasons, all exits in the same micro op will use the same exit object and data.
199+
For simplicity, we will also require that no micro op contains both
200+
an `EXIT_IF` and an `UNLIKELY_EXIT_IF`.
201+
202+
Since `UNLIKELY_EXIT_IF` has no attached exit/executor it is compact;
203+
it only needs two bytes to store the target.
204+
205+
### Guaranteeing progress
206+
207+
We need to guarantee progress within an executor and when exiting executors.
208+
209+
#### Guaranteeing progress within an executor
210+
211+
In order to guarantee progress, the superblock cannot `EXIT_IF` without having made progress,
212+
but it can do an `UNLIKELY_EXIT_IF` since `UNLIKELY_EXIT_IF` always drops to tier 1.
213+
214+
This means that the micro ops for the first tier 1 instruction in an executor
215+
cannot contain an `EXIT_IF`. In practice this means that before translating
216+
to micro-ops, the first instruction must be converted to its unspecialized form.
217+
218+
### Exiting to invalid executors
219+
220+
We want `EXIT_IF` to be efficient, so it needs to enter the next executor unconditionally.
221+
This means that when we invalidate an executor, we need to modify that executor so that
222+
it immediately drops into tier 1 upon being executed. Since tier 1 will never enter an invalid
223+
executor, we are thus guaranteed not to execute an invalid executor.
224+
In the tier 2 interpreter, we change the first instruction to `EXIT_TRACE`.
225+
In the JIT, we will need to change the function pointer to point to a function that returns
226+
to tier 1.
227+
228+
### The mechanics of transferring execution between executors
229+
230+
When transfering control we need to:
231+
1. Incref the new executor
232+
2. Set `current_executor` to the new executor
233+
3. Decref the old executor.
234+
235+
In the future, when we have deferred references, we can make the current executor a deferred reference
236+
and skip the incref/decref.
237+
238+
#### JIT compiler
239+
240+
To transfer control between executors, we make a jump, implemented as an indirect
241+
tail-call in the generated stencil.
242+
243+
The generated code must decref the old executor, as we cannot decref the old executor when still
244+
executing it, meaning we must pass the old executor as an argument in the tail call.
245+
246+
247+
#### Interpreter
248+
249+
We can do the three steps (incref, update current executor, decref) before entering the new executor, since then we don't need to worry about freeing code that we are running.
250+
Entering the new executor is simple: set the
251+
instruction pointer to the first instruction of the new executor.
252+
253+
## Future optimizations
254+
255+
We plan to leave inter-executor optimizations for the future, in order to get a working
256+
implementation with JIT compiler ready in good time for 3.13.
257+
258+
### Specialization across executors
259+
260+
For this we will need to record known type information at exits, to avoid redundant checks,
261+
and to allow us to create multiple specialized executors for the same tier 1 instructions.
262+
263+
### Representation changes across executors
264+
265+
By tracking representation changes across executors, we can avoid the overhead of restoring
266+
the canonical representation on exits.
267+
For example, if a value is represented by an unboxed float, it is expensive to box, then unbox it
268+
across an exit.
269+
This optimization has the potential to be a significant performance win
270+
*and* to consume a lot of memory, so we need to design our data structures carefully.

0 commit comments

Comments
 (0)