-
-
Notifications
You must be signed in to change notification settings - Fork 33.4k
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser #137837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser #137837
Conversation
…arser * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
bb7b873 to
2153a4c
Compare
Doc/library/html.parser.rst
Outdated
| Create a parser instance able to parse invalid markup. | ||
|
|
||
| If *convert_charrefs* is ``True`` (the default), all character | ||
| references (except the ones in ``script``/``style`` elements) are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated now that the list has been expanded.
It might be easier to have a short section about parsing modes, listing each mode, which elements trigger it, whether charrefs are converted or not, and when the state is terminated.
Here we could then say
| references (except the ones in ``script``/``style`` elements) are | |
| references (except the ones in RAWTEXT tags) are |
with RAWTEXT linking to that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to document this here? This is a part of the HTML5 specification. What will the user get from this information?
Lib/html/parser.py
Outdated
| self.set_cdata_mode(tag) | ||
| elif tag == "plaintext": | ||
| self.set_cdata_mode(tag) | ||
| self.interesting = re.compile(r'\z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to move this in set_cdata_mode by adding a third branch to the if/else that sets self.interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this option. But should we repeat condition tag == "plaintext" in two places or add "plaintext" to CDATA_CONTENT_ELEMENTS or RCDATA_CONTENT_ELEMENTS? In any case we will need to repeat "plaintext" twice. This can also create asymmetry with "noscript" if special cases will be handled in different places. So I came to the current code.
Other option is to use special value escapable=None to switch to the PLAINTEXT mode.
|
This PR seems to address 3 issues:
The difference between states is the following:
|
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Lib/html/parser.py
Outdated
| if tag in self.CDATA_CONTENT_ELEMENTS: | ||
| self.set_cdata_mode(tag) | ||
| elif tag in self.RCDATA_CONTENT_ELEMENTS: | ||
| self.set_cdata_mode(tag, escapable=True) | ||
| elif self.scripting and tag == "noscript": | ||
| self.set_cdata_mode(tag) | ||
| elif tag == "plaintext": | ||
| self.set_cdata_mode(tag, escapable=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like too much (ab)using escapable=None for PLAINTEXT mode.
Currently the set_cdata_mode function does two things:
- determines where the closing tag/end is, which depends on the value
tagpassed; - determines whether charrefs are converted, which depends on the value passed to
escapable;
Even though there is some duplication, I would prefer something like this:
if (tag in self.CDATA_CONTENT_ELEMENTS or
(self.scripting and tag == "noscript") or
tag == "plaintext"):
self.set_cdata_mode(tag, escapable=False)
elif tag in self.RCDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag, escapable=True)This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.
Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
…hon into htmlparser-rawtext
|
Thank you for your review @ezio-melotti. |
|
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
GH-140841 is a backport of this pull request to the 3.14 branch. |
|
GH-140842 is a backport of this pull request to the 3.13 branch. |
…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
Backporting to older Python versions should be from 3.13. |
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
… HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
📚 Documentation preview 📚: https://cpython-previews--137837.org.readthedocs.build/