Skip to content

gh-126004: fix positions handling in codecs.xmlcharrefreplace_errors #127675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented Dec 6, 2024

@picnixz picnixz requested a review from encukou January 3, 2025 13:42
/* object is guaranteed to be "ready" */
Py_UCS4 ch = PyUnicode_READ_CHAR(obj, i);
if (ch < 10) {
ressize += 2 + 1 + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to end = start + PY_SSIZE_T_MAX / (2 + 7 + 1); would be to set digits = 1; and after the serie of ifs, check that ressize += (2 + digits + 1); doesn't overflow. If it does overflow, return MemoryError. Something like:

if (resize > PY_SSIZE_T_MAX - (2 + digits + 1)) {
    return PyErr_MemoryError();
}

Copy link
Member Author

@picnixz picnixz Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what PyCodec_NameReplaceErrors does but I don't know how performances would be affected. Hitting PY_SSIZE_T_MAX / (2 + 7 + 1) means that we're handling something that is quite large. So doing the check on resize at each loop iteration might slow down the handler a bit.

Now, it took me a while to convince myself that it won't be slowing down the handlers by much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do < 10, < 100, < 1000 and < 10000 checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.

Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would look like:

    Py_ssize_t ressize = 0;
    for (Py_ssize_t i = start; i < end; ++i) {
        // The number of characters that each character 'ch' contributes
        // in the result is 2 + k + 1, where k = min{t >= 1 | 10^t > ch}
        // and will be formatted as "&#" + DIGITS + ";". Since the Unicode
        // range is below 10^7, each "block" requires at most 2 + 7 + 1
        // characters.
        int replsize;
        /* object is guaranteed to be "ready" */
        Py_UCS4 ch = PyUnicode_READ_CHAR(obj, i);
        if (ch < 10) {
            replsize = 2 + 1 + 1;
        }
        else if (ch < 100) {
            replsize = 2 + 2 + 1;
        }
        else if (ch < 1000) {
            replsize = 2 + 3 + 1;
        }
        else if (ch < 10000) {
            replsize = 2 + 4 + 1;
        }
        else if (ch < 100000) {
            replsize = 2 + 5 + 1;
        }
        else if (ch < 1000000) {
            replsize = 2 + 6 + 1;
        }
        else {
            assert(ch < 10000000);
            replsize = 2 + 7 + 1;
        }
        if (ressize > PY_SSIZE_T_MAX - replsize) {
            Py_DECREF(obj);
            return PyErr_NoMemory();
        }
        ressize += replsize;

But I'm not very fond of this. I think it's still nicer to have the check outside the loop.

Copy link
Member Author

@picnixz picnixz Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks for PyCodec_NameReplaceErrors and PyCodec_BackslashReplaceErrors) need to be kept. For the 'namereplace' handler, we purely break actually:

        for (i = start, ressize = 0; i < end; ++i) {
            /* object is guaranteed to be "ready" */
            c = PyUnicode_READ_CHAR(object, i);
            if (ucnhash_capi->getname(c, buffer, sizeof(buffer), 1)) {
                replsize = 1+1+1+(int)strlen(buffer)+1;
            }
            else if (c >= 0x10000) {
                replsize = 1+1+8;
            }
            else if (c >= 0x100) {
                replsize = 1+1+4;
            }
            else
                replsize = 1+1+2;
            if (ressize > PY_SSIZE_T_MAX - replsize)
                break;
            ressize += replsize;
        }

and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we call getname before...)

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm fine with keeping the code as it is, or changing how PY_SSIZE_T_MAX limit is handled in a separated PR.

@vstinner
Copy link
Member

Strange, the Lint job reports an unrelated issue:

diff --git a/Lib/test/test_asyncio/test_subprocess.py b/Lib/test/test_asyncio/test_subprocess.py
index a053db8..341e3e9 100644
--- a/Lib/test/test_asyncio/test_subprocess.py
+++ b/Lib/test/test_asyncio/test_subprocess.py
@@ -905,7 +905,7 @@ def setUp(self):
             # Force the use of the threaded child watcher
             unix_events.can_use_pidfd = mock.Mock(return_value=False)
             super().setUp()
-        
+
         def tearDown(self):
             unix_events.can_use_pidfd = self._original_can_use_pidfd
             return super().tearDown()

@picnixz
Copy link
Member Author

picnixz commented Jan 22, 2025

Strange, the Lint job reports an unrelated issue:

It might be because I merged main and that main is failing?

@picnixz
Copy link
Member Author

picnixz commented Jan 22, 2025

@picnixz

This comment was marked as off-topic.

@picnixz
Copy link
Member Author

picnixz commented Jan 22, 2025

I'll merge this one and the other PR you approved tomorrow (I've reverted the bad commit).

@picnixz picnixz merged commit 70dcc84 into python:main Jan 23, 2025
41 checks passed
@picnixz picnixz deleted the fix/codecs/xmlcharrefreplace-errors-126004 branch January 23, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants