gh-91924: Optimize unicode_check_encoding_errors()#93200
gh-91924: Optimize unicode_check_encoding_errors()#93200vstinner merged 1 commit intopython:mainfrom vstinner:unicode_check_encoding_errors
Conversation
Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.
serhiy-storchaka
left a comment
There was a problem hiding this comment.
It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.
It seems like there is a misunderstanding here. My change is about the unicode_check_encoding_errors() function which is always called by PyUnicode_AsEncodedString() if Python is built in debug mode. The purpose of this PR is to make a Python debug build "less slow". Microbenchmark on diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 3bc776140a..264e419d82 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5832,6 +5832,33 @@ settrace_to_record(PyObject *self, PyObject *list)
Py_RETURN_NONE;
}
+static PyObject *
+bench_encode(PyObject *self, PyObject *loops_obj)
+{
+ Py_ssize_t loops = PyLong_AsSsize_t(loops_obj);
+ if (loops == -1 && PyErr_Occurred()) {
+ return NULL;
+ }
+
+ PyObject *str = PyUnicode_FromString("");
+ if (str == NULL) {
+ return NULL;
+ }
+
+ _PyTime_t t1 = _PyTime_GetPerfCounter();
+ for (Py_ssize_t i=0; i < loops; i++) {
+ PyObject *obj = PyUnicode_AsEncodedString(str, "utf-8", "strict");
+ Py_DECREF(obj);
+ }
+ _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+ Py_DECREF(str);
+
+ double dt = _PyTime_AsSecondsDouble(t2 - t1);
+ return PyFloat_FromDouble(dt);
+
+}
+
static PyObject *negative_dictoffset(PyObject *, PyObject *);
static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
@@ -6122,6 +6149,7 @@ static PyMethodDef TestMethods[] = {
{"get_feature_macros", get_feature_macros, METH_NOARGS, NULL},
{"test_code_api", test_code_api, METH_NOARGS, NULL},
{"settrace_to_record", settrace_to_record, METH_O, NULL},
+ {"bench_encode", bench_encode, METH_O, NULL},
{NULL, NULL} /* sentinel */
};
Script: import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_encode)Result:
|
|
Extract of the PR: _PyCodec_Lookup() calls normalizestring() + PyUnicode_InternInPlace() + PyDict_GetItemWithError(). normalizestring() calls PyUnicode_FromString(): it decodes the encoding name from UTF-8 and allocates a memory block on the heap memory. It's cheap, but it has a significant impact on performance (see my benchmark) when we know in advance that the encoding name is valid. |
Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.