|
| 1 | +# Tier 2 execution engine |
| 2 | + |
| 3 | +Author: Mark Shannon |
| 4 | + |
| 5 | +The [plan for 3.13](./README.md) describes the main components that we will produce for 3.13. This document explains how the bits fit together at runtime. |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +The design of the tier 2 optimizer is based around "superblock"s. |
| 10 | +These superblocks are a specification of execution, |
| 11 | +rather than something that is executed. |
| 12 | + |
| 13 | +We expect to have two different execution engines for superblocks: |
| 14 | +* The tier 2 interpreter |
| 15 | +* The copy-and-patch compiler |
| 16 | + |
| 17 | +We need to manage entry to and exits from superblocks, so that |
| 18 | +as programs execute, a graph of superblocks will be built up. |
| 19 | + |
| 20 | +While it is critical for performance that execution within |
| 21 | +a superblock is fast, the performance of jumping from one |
| 22 | +superblock to another is also important. |
| 23 | + |
| 24 | +Memory consumption is also important. |
| 25 | + |
| 26 | +### Either interpreter or compiler. Not both. |
| 27 | + |
| 28 | +We assume that there will only be one execution engine. |
| 29 | +If we have a JIT, we will use it to execute *all* superblocks. |
| 30 | +This simplifies things as we do not need to concern ourself with |
| 31 | +transfers from the tier-2-interpreter to JIT-compiled code, or vice versa. |
| 32 | + |
| 33 | +### Superblocks and executors |
| 34 | + |
| 35 | +A superblock is a sequence of micro-ops with minimal control flow. |
| 36 | +It is the input and output of tier 2 optimization phases. |
| 37 | +An executor is the runtime object that is the executable |
| 38 | +representation of a superblock. |
| 39 | + |
| 40 | +#### Creating executors |
| 41 | + |
| 42 | +Creating executors from superblocks is the first job of the execution engine. |
| 43 | +This transformation is currently trivial for the tier 2 interpreter as we |
| 44 | +use the same format for optimization and execution. This is inefficient, so |
| 45 | +we will probably change the executable format. |
| 46 | +For the JIT, we will use copy-and-patch to convert the micro-ops to machine code. |
| 47 | + |
| 48 | +### Exits from executors |
| 49 | + |
| 50 | +On creation of an executor, there will be a number of potential exits |
| 51 | +from that executor. For each entry to an executor, exactly one exit |
| 52 | +will occur. This means that a few of the exits will be hot, but most will be cold. |
| 53 | + |
| 54 | +We want hot exits to be implemented such that transfer to the next executor is fast. |
| 55 | +However, we want cold exits to consume as little memory as possible. |
| 56 | +Unfortunately we cannot know in advance which are which, although there are |
| 57 | +some exits we expect to be very rare. We can handle these very rare exits |
| 58 | +differently, to save space. |
| 59 | + |
| 60 | +### Linking executors |
| 61 | + |
| 62 | +When an exit from an executor gets hot we need to create a new executor |
| 63 | +to attach to that exit. Once we have done that we need to link the exit to |
| 64 | +the new executor in a way that allows fast transfer of execution. |
| 65 | + |
| 66 | +### Making progress |
| 67 | + |
| 68 | +If an executor exits before it does any work *and* it exits to another |
| 69 | +executor that can also exit before it does any work, or it exits to itself, |
| 70 | +we can find ourselves in an infinite loop. |
| 71 | + |
| 72 | +There are two possible approaches to avoiding this: |
| 73 | +1. We track which executors might not make |
| 74 | +progress and are careful about which executors are linked together, or |
| 75 | +2. We require all executors to make progress. |
| 76 | + |
| 77 | +1 has many edge cases which makes it hard to reason about and get correct. |
| 78 | + |
| 79 | +2 is simpler, but may be a bit slower as we may need to [pessimize the first instruction](#guaranteeing-progress-within-an-executor). |
| 80 | + |
| 81 | +We will use approach 2. It is much easier to reason about and should be almost as fast. |
| 82 | + |
| 83 | +If all executors are guaranteed to make progress, then transfers from one executor |
| 84 | +to another can be implemented by a single jump/tail-call. |
| 85 | + |
| 86 | +### Inter-superblock optimization |
| 87 | + |
| 88 | +Some superblocks can be quite short, but form a larger region of hot code, that |
| 89 | +we want to optimize. In order to do that we want to propagate type information |
| 90 | +and representation changes across edges, which means storing that information |
| 91 | +on exits when they are cold. Since most exits will remain cold, we need to |
| 92 | +store this information in a compact form. |
| 93 | + |
| 94 | +### Making "hot" exits fast and "cold" exits small |
| 95 | + |
| 96 | +In order to make hot exits fast, we need them to be as simple as possible, |
| 97 | +passing as little information as possible. |
| 98 | + |
| 99 | +We also want to make cold exits as small as possible, but cold exits |
| 100 | +may become hot exits, so we need them both to use the same interface. |
| 101 | + |
| 102 | +## The implementation |
| 103 | + |
| 104 | +This section describes one possible implementation. The initial implementation |
| 105 | +will not support inter-superblock optimization, but we should plan to support it |
| 106 | +in the future. |
| 107 | + |
| 108 | +### Making "hot" exits fast |
| 109 | + |
| 110 | +In order to make hot exits fast, we want to implement them as a single, unconditional jump, |
| 111 | +with the minimum bookkeeping to maintain refcounts. |
| 112 | +In the tier 2 interpreter, the jump will be implemented by setting the IP to the first |
| 113 | +instruction of the target executor. |
| 114 | +In the JIT, the jump will be implemented as a tail call to the function pointer of the target executor. |
| 115 | + |
| 116 | +### Making "cold" exits small |
| 117 | + |
| 118 | +All side exits from an executor will start cold, and most of them will remain cold. |
| 119 | + |
| 120 | +We need to track various pieces of information for cold exits, none of which will be |
| 121 | +required once the exit becomes hot, so we want to store that information in a way |
| 122 | +that minimizes the cost to hot exits and minimizes the memory used for cold exits. |
| 123 | +That information is: |
| 124 | +* The offset/location of the pointer to the exit in the executor, so it can be updated. |
| 125 | +* The target (offset into the code object of the tier 1 instruction) |
| 126 | +* Any relevant known type information (this is optional but will improve optimization) |
| 127 | +* Any representation changes that have been made. |
| 128 | +* The "hotness" counter |
| 129 | + |
| 130 | +### Minimizing memory use |
| 131 | + |
| 132 | +There are number of ways we can reduce memory use, for example: |
| 133 | +* Putting things in (ideally const) arrays, so we can refer to them by a one or two byte index, |
| 134 | + instead of an eight byte pointers. |
| 135 | +* For complex data like type information and representation changes, use trees or deltas, |
| 136 | + so that the information like `(A, B, C)` can be stored as `(A, B) <- C` allowing `(A, B)` to |
| 137 | + be shared. |
| 138 | + |
| 139 | +### Each executor gets a table of exit data |
| 140 | + |
| 141 | +Each executor will get a table of exit data. We can compute the size of this |
| 142 | +when creating the executor. The entries in this table are described below. |
| 143 | + |
| 144 | +### Fixed number of exit objects |
| 145 | + |
| 146 | +Since the vast majority of exits will not need to be modified (only those that get hot), |
| 147 | +we do not want to pass the offset of the exit, so we need to store the offset in the executor. |
| 148 | + |
| 149 | +Since there is a fixed number of micro-ops allowed in a superblock (currently 512), we have an upper |
| 150 | +bound on the offset. We will preallocate one exit object per possible offset. |
| 151 | + |
| 152 | +### Exit data |
| 153 | + |
| 154 | +* The offset/location of the exit pointer: Not needed if the exit pointer is stored in the exit data. |
| 155 | +* The target: Has a maximum value of about 10**9, so store as a `uint32_t` |
| 156 | +* Any relevant known type information: Format TBD. |
| 157 | + We don't need to decide on the format of this data for now. |
| 158 | +* Any representation changes that have been made: Likewise, format TDB. |
| 159 | +* The "hotness" counter. See below. |
| 160 | + |
| 161 | +#### Representation changes |
| 162 | + |
| 163 | +Representation changes are things like storing the top value(s) of the stack in registers, or performing scalar replacement on objects or frames. |
| 164 | +In order to drop back to tier1, we need to be able to undo those changes, so we need to record them. |
| 165 | + |
| 166 | +#### Hotness counters |
| 167 | + |
| 168 | +We need to track how hot an exit is. Ideally we want to be able to distinguish |
| 169 | +between an exit that is hot, and one that is cold but has exited a large number of |
| 170 | +times over prolonged execution. |
| 171 | + |
| 172 | +There are three ways we can store counters: |
| 173 | +1. Store a counter for each exit in the superblock. |
| 174 | +2. Have one exit per possible value of the counter, and change the exit object to change the counter. |
| 175 | +3. Store the counters in a global (per-interpreter) table. LuaJIT does something like this (but with a very small table). |
| 176 | +<!-- Prevent 1 below being numbered as 4 --> |
| 177 | +1. is simple and deterministic. |
| 178 | +2. either prevents mapping the exits to offsets, or will require many thousands |
| 179 | + of exit objects (number_of_offsets * max_counter_value) |
| 180 | +3. will require hashing the executor and offset, and thus be non-determinstic due to collisions. |
| 181 | + Allows us to decay the counter to distinguish between hot counters and long-lived "cool" counters, |
| 182 | + by halving all the counters every tick. Where a tick might be 1ms, or might vary depending on memory consumption. |
| 183 | + |
| 184 | + |
| 185 | +We should probably start with option 1, with the intention of |
| 186 | +implementing option 3 later. |
| 187 | + |
| 188 | +### `EXIT_IF` and `UNLIKELY_EXIT_IF` |
| 189 | + |
| 190 | +We will replace the misnamed `DEOPT_IF` (it doesn't de-optimize, it exits) |
| 191 | +with `EXIT_IF` and `UNLIKELY_EXIT_IF`. |
| 192 | +`EXIT_IF` will be used for most exits. |
| 193 | +`UNLIKELY_EXIT_IF` will be used for unlikely exits, like eval breaker checks or stack overflow. |
| 194 | + |
| 195 | +`EXIT_IF` will exit through the pointer, as described above. |
| 196 | +`UNLIKELY_EXIT_IF` will exit to tier 1. |
| 197 | + |
| 198 | +For efficiency reasons, all exits in the same micro op will use the same exit object and data. |
| 199 | +For simplicity, we will also require that no micro op contains both |
| 200 | +an `EXIT_IF` and an `UNLIKELY_EXIT_IF`. |
| 201 | + |
| 202 | +Since `UNLIKELY_EXIT_IF` has no attached exit/executor it is compact; |
| 203 | +it only needs two bytes to store the target. |
| 204 | + |
| 205 | +### Guaranteeing progress |
| 206 | + |
| 207 | +We need to guarantee progress within an executor and when exiting executors. |
| 208 | + |
| 209 | +#### Guaranteeing progress within an executor |
| 210 | + |
| 211 | +In order to guarantee progress, the superblock cannot `EXIT_IF` without having made progress, |
| 212 | +but it can do an `UNLIKELY_EXIT_IF` since `UNLIKELY_EXIT_IF` always drops to tier 1. |
| 213 | + |
| 214 | +This means that the micro ops for the first tier 1 instruction in an executor |
| 215 | +cannot contain an `EXIT_IF`. In practice this means that before translating |
| 216 | +to micro-ops, the first instruction must be converted to its unspecialized form. |
| 217 | + |
| 218 | +### Exiting to invalid executors |
| 219 | + |
| 220 | +We want `EXIT_IF` to be efficient, so it needs to enter the next executor unconditionally. |
| 221 | +This means that when we invalidate an executor, we need to modify that executor so that |
| 222 | +it immediately drops into tier 1 upon being executed. Since tier 1 will never enter an invalid |
| 223 | +executor, we are thus guaranteed not to execute an invalid executor. |
| 224 | +In the tier 2 interpreter, we change the first instruction to `EXIT_TRACE`. |
| 225 | +In the JIT, we will need to change the function pointer to point to a function that returns |
| 226 | +to tier 1. |
| 227 | + |
| 228 | +### The mechanics of transferring execution between executors |
| 229 | + |
| 230 | +When transfering control we need to: |
| 231 | +1. Incref the new executor |
| 232 | +2. Set `current_executor` to the new executor |
| 233 | +3. Decref the old executor. |
| 234 | + |
| 235 | +In the future, when we have deferred references, we can make the current executor a deferred reference |
| 236 | +and skip the incref/decref. |
| 237 | + |
| 238 | +#### JIT compiler |
| 239 | + |
| 240 | +To transfer control between executors, we make a jump, implemented as an indirect |
| 241 | +tail-call in the generated stencil. |
| 242 | + |
| 243 | +The generated code must decref the old executor, as we cannot decref the old executor when still |
| 244 | +executing it, meaning we must pass the old executor as an argument in the tail call. |
| 245 | + |
| 246 | + |
| 247 | +#### Interpreter |
| 248 | + |
| 249 | +We can do the three steps (incref, update current executor, decref) before entering the new executor, since then we don't need to worry about freeing code that we are running. |
| 250 | +Entering the new executor is simple: set the |
| 251 | +instruction pointer to the first instruction of the new executor. |
| 252 | + |
| 253 | +## Future optimizations |
| 254 | + |
| 255 | +We plan to leave inter-executor optimizations for the future, in order to get a working |
| 256 | +implementation with JIT compiler ready in good time for 3.13. |
| 257 | + |
| 258 | +### Specialization across executors |
| 259 | + |
| 260 | +For this we will need to record known type information at exits, to avoid redundant checks, |
| 261 | +and to allow us to create multiple specialized executors for the same tier 1 instructions. |
| 262 | + |
| 263 | +### Representation changes across executors |
| 264 | + |
| 265 | +By tracking representation changes across executors, we can avoid the overhead of restoring |
| 266 | +the canonical representation on exits. |
| 267 | +For example, if a value is represented by an unboxed float, it is expensive to box, then unbox it |
| 268 | +across an exit. |
| 269 | +This optimization has the potential to be a significant performance win |
| 270 | +*and* to consume a lot of memory, so we need to design our data structures carefully. |
0 commit comments