gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser by serhiy-storchaka · Pull Request #137837 · python/cpython

serhiy-storchaka · 2025-08-15T20:12:19Z

the "plaintext" element
the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"
optionally RAWTEXT (if scripting=True) element "noscript"

Issue: Support other RAWTEXT and PLAINTEXT elements in HTMLParser #137836

📚 Documentation preview 📚: https://cpython-previews--137837.org.readthedocs.build/

…arser * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"

Doc/library/html.parser.rst

ezio-melotti · 2025-10-24T13:19:31Z

Doc/library/html.parser.rst

   Create a parser instance able to parse invalid markup.

   If *convert_charrefs* is ``True`` (the default), all character
   references (except the ones in ``script``/``style`` elements) are


This should be updated now that the list has been expanded.

It might be easier to have a short section about parsing modes, listing each mode, which elements trigger it, whether charrefs are converted or not, and when the state is terminated.

Here we could then say

Suggested change

references (except the ones in ``script``/``style`` elements) are

references (except the ones in RAWTEXT tags) are

with RAWTEXT linking to that section.

Do we need to document this here? This is a part of the HTML5 specification. What will the user get from this information?

Doc/library/html.parser.rst

Lib/html/parser.py

ezio-melotti · 2025-10-24T13:34:39Z

Lib/html/parser.py

+                self.set_cdata_mode(tag)
+            elif tag == "plaintext":
+                self.set_cdata_mode(tag)
+                self.interesting = re.compile(r'\z')


I think it would be better to move this in set_cdata_mode by adding a third branch to the if/else that sets self.interesting.

I considered this option. But should we repeat condition tag == "plaintext" in two places or add "plaintext" to CDATA_CONTENT_ELEMENTS or RCDATA_CONTENT_ELEMENTS? In any case we will need to repeat "plaintext" twice. This can also create asymmetry with "noscript" if special cases will be handled in different places. So I came to the current code.

Other option is to use special value escapable=None to switch to the PLAINTEXT mode.

ezio-melotti · 2025-10-24T13:35:26Z

This PR seems to address 3 issues:

It adds a scripting arg, that is used to determine how <noscript>...</noscript> is handled. If the browser has JS enabled, the content of <noscript> is not parsed; if it's disabled, the content is parsed and displayed as a fallback. The former is emulated by passing scripting=True (which adds noscript to the list of RAWTEXT elements), the latter by passing scripting=False.
It adds 4 more elements that are (unconditionally) handled as RAWTEXT (in addition to the existing 2, script and style): xmp, iframe, noembed, and noframes.
It adds support for PLAINTEXT state, triggered by the plaintext element.

The difference between states is the following:

Data state (the default): both tags (<element>) and charrefs (&name;) are parsed (and possibly converted if convert_charrefs=True);
RCDATA state (triggered by title and textarea): tags are not parsed by charrefs are (and possibly converted if convert_charrefs=True). After a matching closing tag is met, the parser returns to data state. The list of elements is stored in the RCDATA_CONTENT_ELEMENTS constant;
RAWTEXT state (triggered by script, style, xmp, iframe, noembed, noframes, and possibly noscript, if scripting=True): both tags and charrefs are not parsed/converted, and the content of those 6 elements is returned as is. After the matching closing tag is met, the parser returns to data state. Note that the list of elements is stored in the CDATA_CONTENT_ELEMENTS constant;
PLAINTEXT state (triggered by plaintext): both tags and charrefs are not parsed/converted (like for RAWTEXT) until the EOF -- there is no matching closing tag and everything until the end is emitted as is.

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

Doc/library/html.parser.rst

Lib/html/parser.py

ezio-melotti · 2025-10-31T13:27:31Z

Lib/html/parser.py

@@ -448,6 +458,10 @@ def parse_starttag(self, i):
                self.set_cdata_mode(tag)
            elif tag in self.RCDATA_CONTENT_ELEMENTS:
                self.set_cdata_mode(tag, escapable=True)
+            elif self.scripting and tag == "noscript":
+                self.set_cdata_mode(tag)
+            elif tag == "plaintext":
+                self.set_cdata_mode(tag, escapable=None)


I don't like too much (ab)using escapable=None for PLAINTEXT mode.

Currently the set_cdata_mode function does two things:

determines where the closing tag/end is, which depends on the value tag passed;

determines whether charrefs are converted, which depends on the value passed to escapable;

Even though there is some duplication, I would prefer something like this:

if (tag in self.CDATA_CONTENT_ELEMENTS or (self.scripting and tag == "noscript") or tag == "plaintext"): self.set_cdata_mode(tag, escapable=False) elif tag in self.RCDATA_CONTENT_ELEMENTS: self.set_cdata_mode(tag, escapable=True)

This makes clear that all these cases are handled by set_cdata_mode, with the former ignoring charrefs and the latter converting them.

Then in set_cdata_mode we can set self.interesting based on the values of the args passed. This will also make it clearer what is considered interesting for each tag.

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…hon into htmlparser-rawtext

serhiy-storchaka · 2025-10-31T15:29:32Z

Thank you for your review @ezio-melotti.

miss-islington-app · 2025-10-31T15:44:06Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

bedevere-app · 2025-10-31T15:44:27Z

GH-140841 is a backport of this pull request to the 3.14 branch.

bedevere-app · 2025-10-31T15:44:32Z

GH-140842 is a backport of this pull request to the 3.13 branch.

…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

serhiy-storchaka · 2025-10-31T15:49:08Z

Backporting to older Python versions should be from 3.13.

…Parser (GH-137837) (GH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

…Parser (GH-137837) (GH-140841) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

…n HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

… HTMLParser (pythonGH-137837) (pythonGH-140842) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript" (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

…Parser (GH-137837) (GH-140842) (GH-140853) (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1)

…Parser (GH-137837) (GH-140842) (GH-140850) (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

…arser (GH-137837) (GH-140842) (GH-140857) (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>

…Parser (GH-137837) (GH-140842) (GH-140852) (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Łukasz Langa <lukasz@langa.pl>

…arser (pythonGH-137837) * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"

serhiy-storchaka requested a review from ezio-melotti August 15, 2025 20:12

bedevere-app bot added the awaiting core review label Aug 15, 2025

bedevere-app bot mentioned this pull request Aug 15, 2025

Support other RAWTEXT and PLAINTEXT elements in HTMLParser #137836

Closed

pythongh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLP…

2153a4c

…arser * the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"

serhiy-storchaka force-pushed the htmlparser-rawtext branch from bb7b873 to 2153a4c Compare August 15, 2025 20:15

serhiy-storchaka linked an issue Sep 19, 2025 that may be closed by this pull request

Support other RAWTEXT and PLAINTEXT elements in HTMLParser #137836

Closed

Merge branch 'main' into htmlparser-rawtext

66827e0

serhiy-storchaka commented Oct 15, 2025

View reviewed changes

Doc/library/html.parser.rst Outdated Show resolved Hide resolved

Update Doc/library/html.parser.rst

c8429be

ezio-melotti reviewed Oct 24, 2025

View reviewed changes

serhiy-storchaka and others added 5 commits October 24, 2025 16:56

Apply suggestions from code review

2219106

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

Merge branch 'main' into htmlparser-rawtext

a46e28b

Polish the documentation.

9971a24

Rewrite tests.

69a2b33

Use set_cdata_mode(escapable=None) for PLAINTEXT.

08f4835

ezio-melotti reviewed Oct 31, 2025

View reviewed changes

serhiy-storchaka and others added 4 commits October 31, 2025 16:19

Apply suggestions from code review

428abe1

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

Merge branch 'main' into htmlparser-rawtext

a60ed6e

Apply suggestions.

350ce25

Merge branch 'htmlparser-rawtext' of github.com:serhiy-storchaka/cpyt…

03d3348

…hon into htmlparser-rawtext

ezio-melotti approved these changes Oct 31, 2025

View reviewed changes

bedevere-app bot removed the awaiting core review label Oct 31, 2025

bedevere-app bot added the awaiting merge label Oct 31, 2025

serhiy-storchaka removed needs backport to 3.9 needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Oct 31, 2025

serhiy-storchaka merged commit a17c57e into python:main Oct 31, 2025
58 checks passed

bedevere-app bot removed the awaiting merge label Oct 31, 2025

serhiy-storchaka deleted the htmlparser-rawtext branch October 31, 2025 15:44

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Oct 31, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Oct 31, 2025

ambv pushed a commit that referenced this pull request Oct 31, 2025

[3.10] gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTML…

3a623c6

…Parser (GH-137837) (GH-140842) (GH-140853) (cherry picked from commit a17c57e) (cherry picked from commit 0329bd1)

	references (except the ones in ``script``/``style`` elements) are
	references (except the ones in RAWTEXT tags) are

Uh oh!

Conversation

serhiy-storchaka commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ezio-melotti Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezio-melotti Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

ezio-melotti commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

ezio-melotti Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Oct 31, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Oct 31, 2025

Uh oh!

bedevere-app bot commented Oct 31, 2025

Uh oh!

bedevere-app bot commented Oct 31, 2025

Uh oh!

serhiy-storchaka commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

serhiy-storchaka commented Aug 15, 2025 •

edited by github-actions bot

Loading