Faster builtin functions by performing more work at specialization time. #296
Replies: 5 comments
-
Isn't there a near-infinite number of builtin functions and methods that we'd have to convert this way before it pays off? Even if we could collect the set of most-used builtins from PyPerformance (or the Bloomberg demo for that matter), that wouldn't necessarily translate to other apps. |
Beta Was this translation helpful? Give feedback.
-
It's up to the extension authors. If they want their extensions to be fast, they can use this. I don't expect any immediate results. |
Beta Was this translation helpful? Give feedback.
-
One possible idea in this general space would be to make cython-generated extensions use the new mechanism automatically, which would bring the benefits to a whole bunch of extensions at once. |
Beta Was this translation helpful? Give feedback.
-
It looks like we might get results for this sooner rather than later. The latest stats show large slowdowns for the regex benchmarks relative to 3.10, which is (from a fairly superficial inspection) due to non-specialization (and thus repeatedly cycling though the specializer) of Replacing all these special cases with |
Beta Was this translation helpful? Give feedback.
-
@cfbolz Would this be useful for PyPy? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A lot of the work done by builtin functions, especially simpler ones, is argument parsing, type checking, unboxing and boxing.
If you take a look at
PyArg_ParseTupleAndKeywords
you will the see the amount of work that needs to done to handle the general case.Many builtin functions have five phases:
Argument clinic already generates wrappers that break the above down into 3 phases:
PyArg_Parse...
We can change this so that the argument clinic generated code skips the parse phase:
The interpreter can do the "0th" phase of parsing the arguments.
This would work as follows:
Add a new
METH_N
calling convention that takes exactlyN
arguments, as defined in theMethodDef
struct.So, if
N
were 1, the function pointer would have the signaturef(PyObject *callable, PyObject *args[1])
There would need to be an upper limit on
N
, probably about 6.The vectorcall implementation of this would need to do the parsing, but that's no less efficient that what we do at the moment.
For example consider the builtin method
str.encode
with the signatureencode(self, /, encoding='utf-8', errors='strict')
.If we call
"hi".encode(errors="ignore")
there is a lot of parsing of arguments that has to be done for every call.With
METH_N
we can have argument clinic define this function:Which is reasonably slick.
Note that we must call
str_encode
with exactly 3 arguments, correctly parsed. This is fiddly in the vectorcall wrapper, but can be specialized nicely.Going back to our example of
"hi".encode(errors="ignore")
the three arguments we should be passing are"hi", NULL, "ignore"
We can parse the arguments at specializing time, creating a permutation array that can be evaluated quickly for each call.
One way that could work is:
Most calls don't do anything fancy, so the we would probably special case the "already in the right order" case.
Implementing this
This only really works if we can move most existing builtin functions to the new form.
To do that we need to make it:
Some open questions
There are a few details I've glossed over in the above discussion:
_PyArg_Parser
struct.NULL
terminate the array, at least for debug builds, but would that be sufficient?Beta Was this translation helpful? Give feedback.
All reactions