Skip to content

Conversation

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Aug 15, 2025

  • the "plaintext" element
  • the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
  • optionally RAWTEXT (if scripting=True) element "noscript"

📚 Documentation preview 📚: https://cpython-previews--137837.org.readthedocs.build/

…arser

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
Create a parser instance able to parse invalid markup.

If *convert_charrefs* is ``True`` (the default), all character
references (except the ones in ``script``/``style`` elements) are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated now that the list has been expanded.

It might be easier to have a short section about parsing modes, listing each mode, which elements trigger it, whether charrefs are converted or not, and when the state is terminated.

Here we could then say

Suggested change
references (except the ones in ``script``/``style`` elements) are
references (except the ones in RAWTEXT tags) are

with RAWTEXT linking to that section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to document this here? This is a part of the HTML5 specification. What will the user get from this information?

self.set_cdata_mode(tag)
elif tag == "plaintext":
self.set_cdata_mode(tag)
self.interesting = re.compile(r'\z')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to move this in set_cdata_mode by adding a third branch to the if/else that sets self.interesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this option. But should we repeat condition tag == "plaintext" in two places or add "plaintext" to CDATA_CONTENT_ELEMENTS or RCDATA_CONTENT_ELEMENTS? In any case we will need to repeat "plaintext" twice. This can also create asymmetry with "noscript" if special cases will be handled in different places. So I came to the current code.

Other option is to use special value escapable=None to switch to the PLAINTEXT mode.

@ezio-melotti
Copy link
Member

This PR seems to address 3 issues:

  1. It adds a scripting arg, that is used to determine how <noscript>...</noscript> is handled. If the browser has JS enabled, the content of <noscript> is not parsed; if it's disabled, the content is parsed and displayed as a fallback. The former is emulated by passing scripting=True (which adds noscript to the list of RAWTEXT elements), the latter by passing scripting=False.
  2. It adds 4 more elements that are (unconditionally) handled as RAWTEXT (in addition to the existing 2, script and style): xmp, iframe, noembed, and noframes.
  3. It adds support for PLAINTEXT state, triggered by the plaintext element.

The difference between states is the following:

  • Data state (the default): both tags (<element>) and charrefs (&name;) are parsed (and possibly converted if convert_charrefs=True);
  • RCDATA state (triggered by title and textarea): tags are not parsed by charrefs are (and possibly converted if convert_charrefs=True). After a matching closing tag is met, the parser returns to data state. The list of elements is stored in the RCDATA_CONTENT_ELEMENTS constant;
  • RAWTEXT state (triggered by script, style, xmp, iframe, noembed, noframes, and possibly noscript, if scripting=True): both tags and charrefs are not parsed/converted, and the content of those 6 elements is returned as is. After the matching closing tag is met, the parser returns to data state. Note that the list of elements is stored in the CDATA_CONTENT_ELEMENTS constant;
  • PLAINTEXT state (triggered by plaintext): both tags and charrefs are not parsed/converted (like for RAWTEXT) until the EOF -- there is no matching closing tag and everything until the end is emitted as is.

Comment on lines 457 to 464
if tag in self.CDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag)
elif tag in self.RCDATA_CONTENT_ELEMENTS:
self.set_cdata_mode(tag, escapable=True)
elif self.scripting and tag == "noscript":
self.set_cdata_mode(tag)
elif tag == "plaintext":
self.set_cdata_mode(tag, escapable=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like too much (ab)using escapable=None for PLAINTEXT mode.

Currently the set_cdata_mode function does two things:

  1. determines where the closing tag/end is, which depends on the value tag passed;
  2. determines whether charrefs are converted, which depends on the value passed to escapable;

Even though there is some duplication, I would prefer something like this:

            if (tag in self.CDATA_CONTENT_ELEMENTS or
                (self.scripting and tag == "noscript") or
                tag == "plaintext"):
                self.set_cdata_mode(tag, escapable=False)
            elif tag in self.RCDATA_CONTENT_ELEMENTS:
                self.set_cdata_mode(tag, escapable=True)

This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.

Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.

@serhiy-storchaka
Copy link
Member Author

Thank you for your review @ezio-melotti.

@serhiy-storchaka serhiy-storchaka merged commit a17c57e into python:main Oct 31, 2025
58 checks passed
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the htmlparser-rawtext branch October 31, 2025 15:44
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 31, 2025
…arser (pythonGH-137837)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 31, 2025
…arser (pythonGH-137837)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Oct 31, 2025

GH-140841 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Oct 31, 2025
@bedevere-app
Copy link

bedevere-app bot commented Oct 31, 2025

GH-140842 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Oct 31, 2025
serhiy-storchaka added a commit to miss-islington/cpython that referenced this pull request Oct 31, 2025
…arser (pythonGH-137837)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@serhiy-storchaka
Copy link
Member Author

Backporting to older Python versions should be from 3.13.

serhiy-storchaka added a commit that referenced this pull request Oct 31, 2025
…Parser (GH-137837) (GH-140842)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this pull request Oct 31, 2025
…Parser (GH-137837) (GH-140841)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 31, 2025
…n HTMLParser (pythonGH-137837) (pythonGH-140842)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 31, 2025
…n HTMLParser (pythonGH-137837) (pythonGH-140842)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 31, 2025
…n HTMLParser (pythonGH-137837) (pythonGH-140842)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this pull request Oct 31, 2025
… HTMLParser (pythonGH-137837) (pythonGH-140842)

* the "plaintext" element
* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
* optionally RAWTEXT (if scripting=True) element "noscript"
(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
ambv pushed a commit that referenced this pull request Oct 31, 2025
…Parser (GH-137837) (GH-140842) (GH-140853)

(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)
ambv pushed a commit that referenced this pull request Oct 31, 2025
…Parser (GH-137837) (GH-140842) (GH-140850)

(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
ambv pushed a commit that referenced this pull request Oct 31, 2025
…arser (GH-137837) (GH-140842) (GH-140857)

(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
ambv added a commit that referenced this pull request Oct 31, 2025
…Parser (GH-137837) (GH-140842) (GH-140852)

(cherry picked from commit a17c57e)
(cherry picked from commit 0329bd1)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type-security A security issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support other RAWTEXT and PLAINTEXT elements in HTMLParser

2 participants