-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-126004: fix positions handling in codecs.xmlcharrefreplace_errors
#127675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-126004: fix positions handling in codecs.xmlcharrefreplace_errors
#127675
Conversation
/* object is guaranteed to be "ready" */ | ||
Py_UCS4 ch = PyUnicode_READ_CHAR(obj, i); | ||
if (ch < 10) { | ||
ressize += 2 + 1 + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to end = start + PY_SSIZE_T_MAX / (2 + 7 + 1);
would be to set digits = 1;
and after the serie of ifs, check that ressize += (2 + digits + 1);
doesn't overflow. If it does overflow, return MemoryError. Something like:
if (resize > PY_SSIZE_T_MAX - (2 + digits + 1)) {
return PyErr_MemoryError();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what PyCodec_NameReplaceErrors
does but I don't know how performances would be affected. Hitting PY_SSIZE_T_MAX / (2 + 7 + 1)
means that we're handling something that is quite large. So doing the check on resize
at each loop iteration might slow down the handler a bit.
Now, it took me a while to convince myself that it won't be slowing down the handlers by much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do < 10
, < 100
, < 1000
and < 10000
checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.
Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would look like:
Py_ssize_t ressize = 0;
for (Py_ssize_t i = start; i < end; ++i) {
// The number of characters that each character 'ch' contributes
// in the result is 2 + k + 1, where k = min{t >= 1 | 10^t > ch}
// and will be formatted as "&#" + DIGITS + ";". Since the Unicode
// range is below 10^7, each "block" requires at most 2 + 7 + 1
// characters.
int replsize;
/* object is guaranteed to be "ready" */
Py_UCS4 ch = PyUnicode_READ_CHAR(obj, i);
if (ch < 10) {
replsize = 2 + 1 + 1;
}
else if (ch < 100) {
replsize = 2 + 2 + 1;
}
else if (ch < 1000) {
replsize = 2 + 3 + 1;
}
else if (ch < 10000) {
replsize = 2 + 4 + 1;
}
else if (ch < 100000) {
replsize = 2 + 5 + 1;
}
else if (ch < 1000000) {
replsize = 2 + 6 + 1;
}
else {
assert(ch < 10000000);
replsize = 2 + 7 + 1;
}
if (ressize > PY_SSIZE_T_MAX - replsize) {
Py_DECREF(obj);
return PyErr_NoMemory();
}
ressize += replsize;
But I'm not very fond of this. I think it's still nicer to have the check outside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks for PyCodec_NameReplaceErrors
and PyCodec_BackslashReplaceErrors
) need to be kept. For the 'namereplace' handler, we purely break actually:
for (i = start, ressize = 0; i < end; ++i) {
/* object is guaranteed to be "ready" */
c = PyUnicode_READ_CHAR(object, i);
if (ucnhash_capi->getname(c, buffer, sizeof(buffer), 1)) {
replsize = 1+1+1+(int)strlen(buffer)+1;
}
else if (c >= 0x10000) {
replsize = 1+1+8;
}
else if (c >= 0x100) {
replsize = 1+1+4;
}
else
replsize = 1+1+2;
if (ressize > PY_SSIZE_T_MAX - replsize)
break;
ressize += replsize;
}
and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we call getname
before...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'm fine with keeping the code as it is, or changing how PY_SSIZE_T_MAX limit is handled in a separated PR.
Strange, the Lint job reports an unrelated issue: diff --git a/Lib/test/test_asyncio/test_subprocess.py b/Lib/test/test_asyncio/test_subprocess.py
index a053db8..341e3e9 100644
--- a/Lib/test/test_asyncio/test_subprocess.py
+++ b/Lib/test/test_asyncio/test_subprocess.py
@@ -905,7 +905,7 @@ def setUp(self):
# Force the use of the threaded child watcher
unix_events.can_use_pidfd = mock.Mock(return_value=False)
super().setUp()
-
+
def tearDown(self):
unix_events.can_use_pidfd = self._original_can_use_pidfd
return super().tearDown() |
It might be because I merged main and that main is failing? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I'll merge this one and the other PR you approved tomorrow (I've reverted the bad commit). |
start
andend
values incodecs
error handlers #126004