gh-118350: Add escapable-raw-text mode to html parser#121770
gh-118350: Add escapable-raw-text mode to html parser#121770timonviola wants to merge 10 commits intopython:mainfrom
Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
What is the difference between processing raw text elements and escapable raw text elements? I do not see any this code.
Lib/test/test_htmlparser.py
Outdated
| ("data", content), | ||
| ("endtag", element_lower)]) | ||
|
|
||
| def test_escapable_raw_text_with_closing_tags(self): |
There was a problem hiding this comment.
Is it right? The test name is test_escapable_raw_text_with_closing_tags, but it tests the script element. It looks very similar to test_cdata_with_closing_tags.
Lib/test/test_htmlparser.py
Outdated
| '<!-- not a comment --> ¬-an-entity-ref;', | ||
| "<not a='start tag'>", | ||
| '<a href="" /> <p> <span></span>', | ||
| 'foo = "</scr" + "ipt>";', |
There was a problem hiding this comment.
Why test this in the title and textarea elements?
There was a problem hiding this comment.
Add also examples of valid character references and an ambiguous ampersand.
|
@ezio-melotti @serhiy-storchaka can you help with the review? |
Lib/test/test_htmlparser.py
Outdated
| ('starttag', 'title', []), ('data', text), | ||
| ('endtag', 'title'), ('data', '"'), | ||
| ('starttag', 'textarea', []), ('data', text), | ||
| ('endtag', 'textarea'), ('data', '"')] |
There was a problem hiding this comment.
This is not correct. Charrefs should be resolved in escapable raw text elements. Data should be '"X"X"' instead of text. Except for an ambiguous ampersand.
Lib/test/test_htmlparser.py
Outdated
| ("endtag", element_lower)], | ||
| collector=Collector(convert_charrefs=False)) | ||
|
|
||
| def test_escapable_raw_text_content(self): |
There was a problem hiding this comment.
How does this test differ from test_cdata_content? BTW, most examples use JavaScript syntax, and only relevant for <script>.
Lib/html/parser.py
Outdated
|
|
||
| starttagopen = re.compile('<[a-zA-Z]') | ||
| piclose = re.compile('>') | ||
| escapable_raw_text_close = re.compile('</(title|textarea)>', re.I) |
Lib/html/parser.py
Outdated
| j = n | ||
| if i < j: | ||
| if self.convert_charrefs and not self.cdata_elem: | ||
| if self.convert_charrefs and not self.cdata_elem and not self.escapable_raw_text_elem: |
There was a problem hiding this comment.
This is incorrect. Charrefs should be resolved in an escapable raw text element. Except an ambiguous ampersand.
We need also tests for convert_charrefs=False in an escapable raw text element.
Lib/html/parser.py
Outdated
| """Return full source of start tag: '<...>'.""" | ||
| return self.__starttag_text | ||
|
|
||
| def set_escapable_raw_text_mode(self, elem): |
There was a problem hiding this comment.
Since the behavior for raw text elements and escapable raw text elements is so similar, and they cannot be nested, why not use set_cdata_mode() and cdata_elem for both? Just add an optional boolean parameter to specify whether it is escapable (charrefs should be unescaped) or not.
| ('entityref', 'amp'), | ||
| ('data', ' Pumba') | ||
| ], | ||
| collector=Collector(convert_charrefs=False), |
There was a problem hiding this comment.
Did you mean this test? @serhiy-storchaka
|
It seems that you accidentally removed all changes to the parser. |
@serhiy-storchaka I made tried to revert my changes so I can rebase on top of the latest changes. I created a new PR, I hope that will help you with the review: #135310. Thanks for the comments/input so far @serhiy-storchaka ! |
escapable raw text elements are not handled in the current HTMLParser implementation.
This PR extends the existing parser with an additional mode to handle this correctly.