-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-126004: fix positions handling in codecs.xmlcharrefreplace_errors
#127675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
picnixz
merged 13 commits into
python:main
from
picnixz:fix/codecs/xmlcharrefreplace-errors-126004
Jan 23, 2025
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
8c4d75f
fix `codecs.xmlcharrefreplace_errors` handler
picnixz 4bee635
blurb
picnixz 3820eca
cosmetic changes
picnixz e178ddf
address Petr's review
picnixz 442a938
Merge remote-tracking branch 'upstream/main' into fix/codecs/xmlcharr…
picnixz e3cec4f
use internal `_PyUnicodeError_GetParams` helper
picnixz 5261772
Merge branch 'main' into fix/codecs/xmlcharrefreplace-errors-126004
picnixz e12f17f
Merge branch 'main' into fix/codecs/xmlcharrefreplace-errors-126004
picnixz b1ba109
update usages of `_PyUnicodeError_GetParams`
picnixz debf20f
amend some cosmetic changes to be consistent
picnixz 2ac9464
fix bounds
picnixz 5046358
add assertion per Victor's suggestion
picnixz 9df878e
Merge branch 'main' into fix/codecs/xmlcharrefreplace-errors-126004
picnixz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 changes: 3 additions & 0 deletions
3
Misc/NEWS.d/next/Core_and_Builtins/2024-12-06-11-17-46.gh-issue-126004.-p8MAS.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Fix handling of :attr:`UnicodeError.start` and :attr:`UnicodeError.end` | ||
values in the :func:`codecs.xmlcharrefreplace_errors` error handler. | ||
Patch by Bénédikt Tran. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to
end = start + PY_SSIZE_T_MAX / (2 + 7 + 1);
would be to setdigits = 1;
and after the serie of ifs, check thatressize += (2 + digits + 1);
doesn't overflow. If it does overflow, return MemoryError. Something like:Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what
PyCodec_NameReplaceErrors
does but I don't know how performances would be affected. HittingPY_SSIZE_T_MAX / (2 + 7 + 1)
means that we're handling something that is quite large. So doing the check onresize
at each loop iteration might slow down the handler a bit.Now, it took me a while to convince myself that it won't be slowing down the handlers by much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do
< 10
,< 100
,< 1000
and< 10000
checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would look like:
But I'm not very fond of this. I think it's still nicer to have the check outside the loop.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks for
PyCodec_NameReplaceErrors
andPyCodec_BackslashReplaceErrors
) need to be kept. For the 'namereplace' handler, we purely break actually:and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we call
getname
before...)