-
Notifications
You must be signed in to change notification settings - Fork 52
Description
True function inlining
This is my proposal for function inlining in 3.13. Feedback and alternative proposals welcome.
CPython 3.11 obtained frame inlining. This eliminated most of the overhead of function calls. However, "_PyInterpreterFrame" were still needed. These take the form of the instructions _INIT_CALL_PY_EXACT_ARGS
and _PUSH_FRAME
.
In CPython 3.13, I plan for true function inlining -- that is, no frame creation at all in the happy case, not even _PyInterpreterFrame
frames.
The uops tracer already projects traces through Python function boundaries. The goal is to eliminate _INIT_CALL_PY_EXACT_ARGS
, _PUSH_FRAME
and _POP_FRAME
. Doing this is easy, but doing this while maintaining CPython semantics is very hard.
Before
ADD_TO_TRACE(_CHECK_PEP_523, 1, 0)
ADD_TO_TRACE(_CHECK_FUNCTION_EXACT_ARGS, 1, 4828)
ADD_TO_TRACE(_CHECK_STACK_SPACE, 1, 0)
# Frame creation
ADD_TO_TRACE(_INIT_CALL_PY_EXACT_ARGS, 1, 0)
ADD_TO_TRACE(_SAVE_RETURN_OFFSET, 4, 0)
ADD_TO_TRACE(_PUSH_FRAME, 1, 0)
ADD_TO_TRACE(_SET_IP, 0, 0)
ADD_TO_TRACE(_CHECK_VALIDITY, 0, 0)
ADD_TO_TRACE(_RESUME_CHECK, 0, 0)
...
# Frame removal
ADD_TO_TRACE(_POP_FRAME, 0, 0)
After
_CHECK_PEP_523(1, 12, 0)
_CHECK_FUNCTION_EXACT_ARGS(1, 12, 4828)
_CHECK_STACK_SPACE(1, 12, 0)
_SETUP_TIER2_FRAME(128, 0, 0)
# Minimal bookkeeping, no frame creation
_PRE_INLINE(7, 0, 39)
_SET_FRAME_NAMES(0, 0, 93831592813424)
...
_POST_INLINE(4, 0, 18446744073709551615)
_SET_FRAME_NAMES(0, 0, 93831592813424)
The cost of a call went from:
Before:
- Bump frame chunk pointer (or use new chunk, if out of bumps).
- Copy over args to new frame locals.
- Frame bookkeeping.
After:
- Update a few pointers.
- Frame bookkeping (only in the case of
sys._getframe
, see below).
Locals interleaving
By interleaving the locals of the new call with the stack of the old function frame, we can achieve zero argument copying.
Before a function call, the stack looks like this:
callable | self_or_null | arg1 | arg2 |
---|---|---|---|
^ (stack pointer) |
Notice that the arguments are already laid out as how the locals of the new "frame" will be. We just need to expand the stack to accommodate for more locals. Finally, since we are now pointing into the current stack as our locals, we offset all LOAD_FAST
and STORE_FAST
into the current stack.
callable | self_or_null | arg1 | arg2 | |
---|---|---|---|---|
^ LOAD_FAST 0 + offset of inlined frame |
^ (stack pointer) |
Finally, we must update the attribute names' tuple that the current frame points to, so that the inlined call's attribute loads load the correct attribute names. This is just a simple assignment.
Frame reconstruction
We need to reconstruct frames on every deopt, traceback, sys._getframe
, and possibly locals()
too. To reconstruct frames, we need to store the following information in the trace for each inlined frame in the current call stack:
- return offset
- instruction pointer
- function object seen at trace creation time.
- original code object seen at trace creation time (important! because this may differ from the code object seen in the current function object!)
- current stack entries after the call
For the following call stack:
(Not inlined) Root frame [Inlined Frame 1, Inlined Frame 2, Inlined Frame 3]
We reconstruct from the root frame upwards -- so in the order Inlined Frame 1 -> Inlined Frame 2 -> Inlined Frame 3
. Note that we can statically determine the current stack entries of all inlined frames except the topmost frame -- inlined frame 3
. For that, we simply need to calculate at runtime from the current stack pointer.
If you want to see how complex this all is, I implemented the reconstruction in roughly 120 lines of code (with lots of debugging code inside), in my previous iteration of my optimizer. This already works with deopts and tracebacks.
sys._getframe
and locals()
There is one minor inelegance with how we handle sys._getframe
, but it's also a strength. In the case of deopts and tracebacks, after reconstruction, the recreated frames are now the ones being executed. In the case of sys._getframe
, the frame being reconstructed is purely for sys._getframe
. That means we do not actually execute the reconstructed frames. The reason behind this is that we have no clue where stack_pointer
is for the topmost frame, as it is local to ceval.c
, so we cannot accurately create the topmost frame's stack. For this reason, we link up the frames as per usual, mark them as inlined virtual
, and clear them on _POST_INLINE
as per normal. This means even in the case of sys._getframe
, the only cost we incur is the cost of the original function overhead plus a little extra. sys._getframe should have no noticeable slowdown.
How will this play with the JIT?
I assume the JIT will get some form of register allocation. If Brandt implements some sort of stack operands cached in register scheme. The locals interleaving will mean Python calls pass their arguments by register automatically, with no additional infrastructure changes required to make the JIT aware of inlining. The only change the JIT would need to be aware of is that it needs to change the stack operands that it caches to the new virtual inlined stack base rather than the original stack base.