Skip to content

Conversation

@dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Dec 4, 2025

In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped.

This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand.

  • At least two implementers are interested (and none opposed):
  • Tests are written and can be reviewed and commented upon at:
  • Implementation bugs are filed:
    • Chromium: …
    • Gecko: …
    • WebKit: …
    • Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
    • Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
  • Corresponding HTML AAM & ARIA in HTML issues & PRs:
  • MDN issue is filed: …
  • The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)

In the example highlighting ambiguities from missing semicolons on named
character references, a "correct" encoding is provided, but that example
makes no mention of the fact that the fragment was ambiguous precisely
because the ampersand wasn't escaped.

This patch adds a clarifying note explaining how this situation is
avoided by always escaping the ampersand.

Co-authored-by: Jon Surrell <jon.surrell@automattic.com>
GitHub-PR: 11988
GitHub-PR-URL: whatwg#11988
@dmsnell dmsnell force-pushed the syntax-errors/always-escape-amp branch from d1fb385 to 9753779 Compare December 4, 2025 19:50
@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 4, 2025

As a side note, I overlooked adding my name to the list of contributors in my first submission.

@sirreal
Copy link

sirreal commented Dec 5, 2025

I was surprised to find no recommendation about escaping & with character references anywhere in the HTML standard. The section this PR touches seems to encourage not escaping & if it is not ambiguous (bold mine):

Thus, the correct way to express the above cases is as follows:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

I read this as if &amp;ted would be wrong in some way, since it isn't the correct way. However, it seems much simpler to me to escape the ampersand here as &amp;.

I would change this section to something like the following:

-<!-- &ted is ok, since it's not a named character reference -->
+<!-- "&ted" is ok because "ted" is not a named character reference. 
+<!-- "&amp;ted" is equivalent and less error-prone because "&amp;" explicitly decodes to "&". -->

There is precedent for such a recommendation. Section 4.12.1.3 Restrictions for contents of script elements has a prominent note with an encoding recommendation:

The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape an ASCII case-insensitive match for "<!--" as "\x3C!--", "<script" as "\x3Cscript", and "</script" as "\x3C/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions. Doing so avoids the pitfalls that the restrictions in this section are prone to triggering: namely, that, for historical reasons, parsing of script blocks in HTML is a strange and exotic practice that acts unintuitively in the face of these sequences.


Section 13.1.4 Character references seems like a good place to add a similar note. For example

Note

Where character references are allowed, it's a good idea to always encode & with its character reference &amp;. This prevents any ambiguity as to whether the & is part of a character reference or a literal &.

I would consider mention the most common characters that are useful to escape in different contexts, but the note about & seems particularly helpful.

@annevk
Copy link
Member

annevk commented Dec 5, 2025

https://html.spec.whatwg.org/multipage/syntax.html#character-references already requires this so I'm not sure we need to state it again in the parser section. Is the problem that the parser doesn't flag it?

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

Is the problem that the parser doesn't flag it?

I believe the problem here is that the illustrative example in the syntax-error section explicitly states that the correct way to produce HTML text containing & is to not escape it if what follows is not a legitimately-parsed character reference.

The example illustrates that a parser will correctly identify &ted as that raw string, but suggests that &ted is more appropriate than &amp;ted.

So basically this is just a confusing aspect for implementers and it seems like we could tweak the wording to maintain the demonstration of how these errors are handled without encouraging people to lean on syntax errors in cases where they produce the right output.

@annevk
Copy link
Member

annevk commented Dec 5, 2025

I see, this is part of https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors.

We don't disallow &ted currently so unless we also change the HTML Writing requirements in some way I'd be a bit hesitant to change it in this one place.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

@annevk thanks. I’m very open to trying out different ideas, but I think the spec is actually a bit vague on this.

already requires this

Unless I’m wrong, the spec does not require that & be escaped as &amp;, only that when mixing character references with text that they must begin with & and be followed by the correct syntax.

However, if someone is authoring HTML and not intending to produce a character reference, a stray & is both properly decoded by the parser and not forbidden.

I think we all agree that the intention is to always escape & as &amp;, but in the nitty gritty, unless it’s hidden in some other section none of us have scoured up yet, it’s not explicitly normalized as such. The only reference we’ve been able to find that isn’t implied is the one in this PR, where the spec assertively states that it’s correct to omit the escaping.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

I apologize for omitting the before/after screenshots, but I took a before shot and was waiting to add it to the description until I had the parser previews generated but then they never appeared and I forgot to upload the before-shot anyway. Here is the relevant context from the modified section.

Screenshot 2025-12-04 at 12 51 32 PM

@annevk
Copy link
Member

annevk commented Dec 5, 2025

That's what I'm saying as well though in my latest comment. The Writing section explicitly allows you to do this. So I don't want to accept this PR as-is, as it'll contradict the Writing section.

@zcorpan was involved in some of the details here and should probably weigh in.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

sounds great, and I have no wish that this be as-is. in fact, I was hoping for further input because I myself struggled to figure out how best to represent it. @sirreal is the author of the original suggestion.

interestingly enough, the HTML 3 spec was clearer on this point, but that entire document comprises only a handful of ill-defined paragraphs 🙃

Because certain characters will be interpreted as markup, they should be represented by markup…for instance the character "&" must be represented by the entity &amp;.

@zcorpan
Copy link
Member

zcorpan commented Dec 9, 2025

I think it's worth considering switching to require escaped ampersands. The rules for when it's allowed are non-trivial and it's surprising that &ted is OK but &copy is not OK, or that the behavior is different between in data and in attribute values.

Always escape & is clear and easy to understand.

This was my position in 2007 also: https://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-September/012457.html

cc @hsivonen @sideshowbarker

@sirreal
Copy link

sirreal commented Dec 9, 2025

Always escape & is clear and easy to understand.

This is what I'd really like to address with at least a recommendation in the HTML standard that & is best escaped where applicable.

@dmsnell linked to the HTML3 spec. HTML4 also makes a recommendation:

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter).

Escaping & is something we understand implicitly and it's apparent in functions like PHP's htmlspecialchars or Python's html.escape.

An explicit recommendation in the standard about & escaping would be a service to web developers.

@zcorpan
Copy link
Member

zcorpan commented Dec 11, 2025

I think we should make it a parse error if we change this.

@sideshowbarker
Copy link
Member

I think we should make it a parse error if we change this.

I would rather we don't. I say that because, I don't actually want to implement an error or warning for this in the checker — despite whatever the spec may end up being changed to say here. I don't think it will actually be good for users to be getting new errors or warnings from the checker about this.

But if it's made an actual parse error in the spec, I would somewhat be forced into it, regardless — because for errors from the HTML parser, the checker basically just bubbles all those up as-is.

That said, I would also not personally implement a parse error for it in the HTML parser sources. But there's nothing that would prevent any other contributor (or code owner) for the parser code from implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants