gh-91924: Optimize unicode_check_encoding_errors() by vstinner · Pull Request #93200 · python/cpython

vstinner · 2022-05-25T02:47:07Z

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.

vstinner · 2022-05-25T02:47:15Z

cc @serhiy-storchaka

serhiy-storchaka

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

vstinner · 2022-05-25T10:30:54Z

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

It seems like there is a misunderstanding here. My change is about the unicode_check_encoding_errors() function which is always called by PyUnicode_AsEncodedString() if Python is built in debug mode.

The purpose of this PR is to make a Python debug build "less slow".

Microbenchmark on utf-8 encoding and strict error handler:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 3bc776140a..264e419d82 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5832,6 +5832,33 @@ settrace_to_record(PyObject *self, PyObject *list)
     Py_RETURN_NONE;
 }
 
+static PyObject *
+bench_encode(PyObject *self, PyObject *loops_obj)
+{
+    Py_ssize_t loops = PyLong_AsSsize_t(loops_obj);
+    if (loops == -1 && PyErr_Occurred()) {
+        return NULL;
+    }
+
+    PyObject *str = PyUnicode_FromString("");
+    if (str == NULL) {
+        return NULL;
+    }
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *obj = PyUnicode_AsEncodedString(str, "utf-8", "strict");
+        Py_DECREF(obj);
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+    Py_DECREF(str);
+
+    double dt = _PyTime_AsSecondsDouble(t2 - t1);
+    return PyFloat_FromDouble(dt);
+
+}
+
 static PyObject *negative_dictoffset(PyObject *, PyObject *);
 static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
 static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
@@ -6122,6 +6149,7 @@ static PyMethodDef TestMethods[] = {
     {"get_feature_macros", get_feature_macros, METH_NOARGS, NULL},
     {"test_code_api", test_code_api, METH_NOARGS, NULL},
     {"settrace_to_record", settrace_to_record, METH_O, NULL},
+    {"bench_encode", bench_encode, METH_O, NULL},
     {NULL, NULL} /* sentinel */
 };

Script:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_encode)

Result:

pydebug, gcc -O0: Mean +- std dev: [gcc_O0_ref] 1.95 us +- 0.03 us -> [gcc_O0_pr] 147 ns +- 4 ns: 13.20x faster
pydebug, gcc -Og: Mean +- std dev: [gcc_Og_ref] 651 ns +- 7 ns -> [gcc_Og_pr] 35.6 ns +- 0.8 ns: 18.29x faster

vstinner · 2022-05-25T10:32:47Z

Extract of the PR:

        // Fast path for the most common built-in encodings. Even if the codec
        // is cached, _PyCodec_Lookup() decodes the bytes string from UTF-8 to
        // create a temporary Unicode string (the key in the cache).

_PyCodec_Lookup() calls normalizestring() + PyUnicode_InternInPlace() + PyDict_GetItemWithError().

normalizestring() calls PyUnicode_FromString(): it decodes the encoding name from UTF-8 and allocates a memory block on the heap memory. It's cheap, but it has a significant impact on performance (see my benchmark) when we know in advance that the encoding name is valid.

gh-91924: Optimize unicode_check_encoding_errors()

622d301

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.

vstinner added the skip news label May 25, 2022

bedevere-bot added the awaiting core review label May 25, 2022

serhiy-storchaka reviewed May 25, 2022

View reviewed changes

vstinner merged commit 5f8c3fb into python:main May 26, 2022

bedevere-bot removed the awaiting core review label May 26, 2022

vstinner deleted the unicode_check_encoding_errors branch May 26, 2022 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-91924: Optimize unicode_check_encoding_errors()#93200

gh-91924: Optimize unicode_check_encoding_errors()#93200
vstinner merged 1 commit intopython:mainfrom
vstinner:unicode_check_encoding_errors

vstinner commented May 25, 2022

Uh oh!

vstinner commented May 25, 2022

Uh oh!

serhiy-storchaka left a comment

Uh oh!

vstinner commented May 25, 2022 •

edited

Loading

Uh oh!

vstinner commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vstinner commented May 25, 2022

Uh oh!

vstinner commented May 25, 2022

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vstinner commented May 25, 2022 •

edited

Loading