Skip to content

Call design for Tier 2 (uops) interpreter #106581

Closed
@gvanrossum

Description

@gvanrossum

(Maybe this is tentative enough that it still belongs in the faster-cpython/ideas tracker, but I hope we're close enough that we can hash it out here. CC @markshannon, @brandtbucher)

(This is a WIP until I have looked a bit deeper into this.)

First order of business is splitting some of the CALL specializations into multiple ops satisfying the uop requirement: either use oparg and no cache entries, or don't use oparg and use at most one cache entry. For example, one of the more important ones, CALL_PY_EXACT_ARGS, uses both oparg (the number of arguments) and a cache entry (func_version). Splitting it into a guard and an action op is problematic: even discounting the possibility of encountering a bound method (i.e., assuming method is NULL), it contains the following DEOPT calls:

            // PyObject *callable = stack_pointer[-1-oparg];
            DEOPT_IF(tstate->interp->eval_frame, CALL);
            int argcount = oparg;
            PyFunctionObject *func = (PyFunctionObject *)callable;
            DEOPT_IF(!PyFunction_Check(callable), CALL);
            PyFunctionObject *func = (PyFunctionObject *)callable;
            DEOPT_IF(func->func_version != func_version, CALL);
            PyCodeObject *code = (PyCodeObject *)func->func_code;
            DEOPT_IF(code->co_argcount != argcount, CALL);
            DEOPT_IF(!_PyThreadState_HasStackSpace(tstate, code->co_framesize), CALL);

If we wanted to combine all this in a single guard op, that guard would require access to both oparg (to dig out callable) and func_version. The fundamental problem is that the callable, which needs to be prodded and poked for the guard to pass, is buried under the arguments, and we need to use oparg to know how deep it is buried.

What if we somehow reversed this so that the callable is on top of the stack, after the arguments? We could arrange for this by adding a COPY n+1 opcode just before the CALL opcode (or its specializations). In fact, this could even be a blessing in disguise, since now we would no longer need to push a NULL before the callable to reserve space for self -- instead, if the callable is found to be a bound method, its self can overwrite the original callable (below the arguments) and the function extracted from the bound method can overwrite the copy of the callable above the arguments. This has the advantage of no longer needing to have a "push NULL" bit in several other opcodes (the LOAD_GLOBAL and LOAD_ATTR families -- we'll have to review the logic in LOAD_ATTR a bit more to make sure this can work).

(Note that the key reason why the callable is buried below the arguments is a requirement about evaluation order in expressions -- the language reference requires that in the expression F(X) where F and X themselves are possibly complex expressions, F is evaluated before X.)

Comparing before and after, currently we have the following arrangement on the stack when CALL n or any of its specializations is reached:

    NULL
    callable
    arg[0]
    arg[1]
    ...
    arg[n-1]

This is obtained by e.g.

    PUSH_NULL
    LOAD_FAST callable
    <load n args>
    CALL n

or

    LOAD_GLOBAL (NULL + callable)
    <load n args>
    CALL n

or

    LOAD_ATTR (NULL|self + callable)
    <load n args>
    CALL n

Under my proposal the arrangement would change to

    callable
    arg[0]
    arg[1]
    ...
    arg[n-1]
    callable

and it would be obtained by

    LOAD_FAST callable  /  LOAD_GLOBAL callable  /  LOAD_ATTR callable
    <load n args>
    COPY n+1
    CALL n

It would (perhaps) even be permissible for the guard to overwrite both copies of the callable if a method is detected, since it would change from

    self.func
    <n args>
    self.func

to

    self
    <n args>
    func

where we would be assured that func has type PyFunctionObject *. (However, I think we ought to have separate specializations for the two cases, since the transformation would also require bumping oparg.)

The runtime cost would be an extra COPY instruction before each CALL; however I think this might actually be simpler than the dynamic check for bound methods, at least when using copy-and-patch.

Another cost would be requiring extra specializations for some cases that currently dynamically decide between function and method; but again I think that with copy-and-patch that is probably worth it, given that we expect that dynamic check to always go the same way for a specific location.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usage

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions