Description
Unlike #131498 which was a wash for performance, TOS caching in the JIT promises substantial performance improvements. This is because we can create several stencils for each uop, tailored for the number of registers and dynamically vary the number of values cached.
For example, in this code:
LOAD_FAST_BORROW
LOAD_FAST_BORROW
BINARY_OP_ADD_INT
STORE_FAST
we can tailor each version to the number of registers cached:
LOAD_FAST_BORROW_0_1 ( 0 -> 1 registers)
LOAD_FAST_BORROW_1_2 ( 1 -> 2 registers)
BINARY_OP_ADD_INT_2_1 ( 2 -> 1 registers)
STORE_FAST_1_0 ( 1 -> 0 registers)
thus avoiding any memory traffic to and from the stack at all.
The exact number of variants per uop will need to be determined empirically.
Having more stencils allows more freedom when generating code, but excessive numbers of stencils would cause bloat both at runtime, and in any repository containing the stencils.
Spilling and reloading
There will be an upper bound to the number of values cached and some uops may need a minimum number of values in the cache.
To handle those we will need to insert spill and reload uops. Spills will reduce the number of cached values, saving them to the in-memory stack and reloads will do the opposite moving values from the in-memory stack to the cache.
E.g.
LOAD_FAST_BORROW
BINARY_OP_ADD_INT
BINARY_OP_ADD_INT
expects two inputs, but we only have one cached (from the LOAD_FAST_BORROW
) so we need to insert a RELOAD
:
LOAD_FAST_BORROW ( 0 -> 1 registers)
RELOAD_1_2 ( 1 -> 2 registers)
BINARY_OP_ADD_INT ( 2 -> 1 registers)
SPILL
and RELOAD
are semantic no-ops, and will be generated automatically.
Deferred references
For this to work the code generator must spill any cached values to the in-memory stack when GC could occur. Fortunately, the code generator already does this (as part of the work for #131498).