Skip to content

True function inlining: design discussion #653

@Fidget-Spinner

Description

@Fidget-Spinner

True function inlining

This is my proposal for function inlining in 3.13. Feedback and alternative proposals welcome.

CPython 3.11 obtained frame inlining. This eliminated most of the overhead of function calls. However, "_PyInterpreterFrame" were still needed. These take the form of the instructions _INIT_CALL_PY_EXACT_ARGS and _PUSH_FRAME.

In CPython 3.13, I plan for true function inlining -- that is, no frame creation at all in the happy case, not even _PyInterpreterFrame frames.

The uops tracer already projects traces through Python function boundaries. The goal is to eliminate _INIT_CALL_PY_EXACT_ARGS, _PUSH_FRAME and _POP_FRAME. Doing this is easy, but doing this while maintaining CPython semantics is very hard.

Before

ADD_TO_TRACE(_CHECK_PEP_523, 1, 0)
ADD_TO_TRACE(_CHECK_FUNCTION_EXACT_ARGS, 1, 4828)
ADD_TO_TRACE(_CHECK_STACK_SPACE, 1, 0)
# Frame creation
ADD_TO_TRACE(_INIT_CALL_PY_EXACT_ARGS, 1, 0)
ADD_TO_TRACE(_SAVE_RETURN_OFFSET, 4, 0)
ADD_TO_TRACE(_PUSH_FRAME, 1, 0)
ADD_TO_TRACE(_SET_IP, 0, 0)
ADD_TO_TRACE(_CHECK_VALIDITY, 0, 0)
ADD_TO_TRACE(_RESUME_CHECK, 0, 0)
...
# Frame removal
ADD_TO_TRACE(_POP_FRAME, 0, 0)

After

_CHECK_PEP_523(1, 12, 0)
_CHECK_FUNCTION_EXACT_ARGS(1, 12, 4828)
_CHECK_STACK_SPACE(1, 12, 0)
_SETUP_TIER2_FRAME(128, 0, 0)
 # Minimal bookkeeping, no frame creation
_PRE_INLINE(7, 0, 39)
_SET_FRAME_NAMES(0, 0, 93831592813424)
...
_POST_INLINE(4, 0, 18446744073709551615)
_SET_FRAME_NAMES(0, 0, 93831592813424)

The cost of a call went from:

Before:

  1. Bump frame chunk pointer (or use new chunk, if out of bumps).
  2. Copy over args to new frame locals.
  3. Frame bookkeeping.

After:

  1. Update a few pointers.
  2. Frame bookkeping (only in the case of sys._getframe, see below).

Locals interleaving

By interleaving the locals of the new call with the stack of the old function frame, we can achieve zero argument copying.

Before a function call, the stack looks like this:

callable self_or_null arg1 arg2
^ (stack pointer)

Notice that the arguments are already laid out as how the locals of the new "frame" will be. We just need to expand the stack to accommodate for more locals. Finally, since we are now pointing into the current stack as our locals, we offset all LOAD_FAST and STORE_FAST into the current stack.

callable self_or_null arg1 arg2
^ LOAD_FAST 0 + offset of inlined frame ^ (stack pointer)

Finally, we must update the attribute names' tuple that the current frame points to, so that the inlined call's attribute loads load the correct attribute names. This is just a simple assignment.

Frame reconstruction

We need to reconstruct frames on every deopt, traceback, sys._getframe, and possibly locals() too. To reconstruct frames, we need to store the following information in the trace for each inlined frame in the current call stack:

  • return offset
  • instruction pointer
  • function object seen at trace creation time.
  • original code object seen at trace creation time (important! because this may differ from the code object seen in the current function object!)
  • current stack entries after the call

For the following call stack:

(Not inlined) Root frame [Inlined Frame 1, Inlined Frame 2, Inlined Frame 3]

We reconstruct from the root frame upwards -- so in the order Inlined Frame 1 -> Inlined Frame 2 -> Inlined Frame 3. Note that we can statically determine the current stack entries of all inlined frames except the topmost frame -- inlined frame 3. For that, we simply need to calculate at runtime from the current stack pointer.

If you want to see how complex this all is, I implemented the reconstruction in roughly 120 lines of code (with lots of debugging code inside), in my previous iteration of my optimizer. This already works with deopts and tracebacks.

sys._getframe and locals()

There is one minor inelegance with how we handle sys._getframe, but it's also a strength. In the case of deopts and tracebacks, after reconstruction, the recreated frames are now the ones being executed. In the case of sys._getframe, the frame being reconstructed is purely for sys._getframe. That means we do not actually execute the reconstructed frames. The reason behind this is that we have no clue where stack_pointer is for the topmost frame, as it is local to ceval.c, so we cannot accurately create the topmost frame's stack. For this reason, we link up the frames as per usual, mark them as inlined virtual, and clear them on _POST_INLINE as per normal. This means even in the case of sys._getframe, the only cost we incur is the cost of the original function overhead plus a little extra. sys._getframe should have no noticeable slowdown.

How will this play with the JIT?

I assume the JIT will get some form of register allocation. If Brandt implements some sort of stack operands cached in register scheme. The locals interleaving will mean Python calls pass their arguments by register automatically, with no additional infrastructure changes required to make the JIT aware of inlining. The only change the JIT would need to be aware of is that it needs to change the stack operands that it caches to the new virtual inlined stack base rather than the original stack base.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions