gh-140875: Fix handling of unclosed charrefs before EOF in HTMLParser#140904
Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
I have added many tests to check consistency between unclosed character references followed by EOF and other character, with both convert_charrefs modes.
| # Maybe HTMLParser should use self.unescape for these | ||
| data = [ | ||
| ('a&', [('data', 'a&')]), | ||
| ('a&b', [('data', 'ab')]), |
There was a problem hiding this comment.
This is the reported bug. It was in the tests!
| self._run_check('>', [('entityref', 'gt')], convert_charrefs=False) | ||
| self._run_check('>', [('data', '>')], convert_charrefs=True) | ||
|
|
||
| self._run_check('&g', [('entityref', 'g')], convert_charrefs=False) |
There was a problem hiding this comment.
Ampersand was only swallowed in the case of 1-character name before EOF.
| self._run_check('& z', [('data', '& z')], convert_charrefs=True) | ||
|
|
||
| def test_eof_in_entityref(self): | ||
| self._run_check('>', [('entityref', 'gt')], convert_charrefs=False) |
There was a problem hiding this comment.
It was data before this change.
c5ce0aa to
fe8bcb2
Compare
| elif i + 3 < n: # larger than "&#x" | ||
| # not the end of the buffer, and can't be confused | ||
| # with some other construct | ||
| self.handle_data("&#") |
There was a problem hiding this comment.
What's the reason for emitting &# as data now, instead of emitting it later with the rest of the data? IOW, in the case of '&x y', this will emit handle_data('&#') + handle_data(' y') instead of a single handle_data('&# y').
Having multiple handle_data is not wrong per se, but if we remove the elif block and let the else break, we can simplify the code and emit a single handle_data.
The same might apply below, where a single & is emitted. Also note that for ' z &x y', the first handle_data gets called with only 'z ', even if followed by one or more additional handle_data.
There was a problem hiding this comment.
Try &#x <. With this PR it emits handle_data('&#'), handle_data('x '), handle_entityref('lt') which is not optimal, but correct. If remove this elif, it will emit handle_data('&#x <'), which is incorrect.
|
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14. |
…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
GH-141745 is a backport of this pull request to the 3.14 branch. |
…Parser (pythonGH-140904) (cherry picked from commit 95296a9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
GH-141746 is a backport of this pull request to the 3.13 branch. |
html.parser(convert_charrefs=False)silently drops ampersand (&) on invalid named entities, causing data loss #140875