-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Encourage always-escaping ampersand character. #11988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped. This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand. Co-authored-by: Jon Surrell <jon.surrell@automattic.com> GitHub-PR: 11988 GitHub-PR-URL: whatwg#11988
d1fb385 to
9753779
Compare
|
As a side note, I overlooked adding my name to the list of contributors in my first submission. |
|
I was surprised to find no recommendation about escaping
I read this as if I would change this section to something like the following: -<!-- &ted is ok, since it's not a named character reference -->
+<!-- "&ted" is ok because "ted" is not a named character reference.
+<!-- "&ted" is equivalent and less error-prone because "&" explicitly decodes to "&". -->There is precedent for such a recommendation. Section 4.12.1.3 Restrictions for contents of script elements has a prominent note with an encoding recommendation:
Section 13.1.4 Character references seems like a good place to add a similar note. For example Note Where character references are allowed, it's a good idea to always encode I would consider mention the most common characters that are useful to escape in different contexts, but the note about |
|
https://html.spec.whatwg.org/multipage/syntax.html#character-references already requires this so I'm not sure we need to state it again in the parser section. Is the problem that the parser doesn't flag it? |
I believe the problem here is that the illustrative example in the syntax-error section explicitly states that the correct way to produce HTML text containing The example illustrates that a parser will correctly identify So basically this is just a confusing aspect for implementers and it seems like we could tweak the wording to maintain the demonstration of how these errors are handled without encouraging people to lean on syntax errors in cases where they produce the right output. |
|
I see, this is part of https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors. We don't disallow |
|
@annevk thanks. I’m very open to trying out different ideas, but I think the spec is actually a bit vague on this.
Unless I’m wrong, the spec does not require that However, if someone is authoring HTML and not intending to produce a character reference, a stray I think we all agree that the intention is to always escape |
|
That's what I'm saying as well though in my latest comment. The Writing section explicitly allows you to do this. So I don't want to accept this PR as-is, as it'll contradict the Writing section. @zcorpan was involved in some of the details here and should probably weigh in. |
|
sounds great, and I have no wish that this be as-is. in fact, I was hoping for further input because I myself struggled to figure out how best to represent it. @sirreal is the author of the original suggestion. interestingly enough, the HTML 3 spec was clearer on this point, but that entire document comprises only a handful of ill-defined paragraphs 🙃
|
|
I think it's worth considering switching to require escaped ampersands. The rules for when it's allowed are non-trivial and it's surprising that Always escape This was my position in 2007 also: https://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-September/012457.html |
This is what I'd really like to address with at least a recommendation in the HTML standard that @dmsnell linked to the HTML3 spec. HTML4 also makes a recommendation:
Escaping An explicit recommendation in the standard about |
|
I think we should make it a parse error if we change this. |
I would rather we don't. I say that because, I don't actually want to implement an error or warning for this in the checker — despite whatever the spec may end up being changed to say here. I don't think it will actually be good for users to be getting new errors or warnings from the checker about this. But if it's made an actual parse error in the spec, I would somewhat be forced into it, regardless — because for errors from the HTML parser, the checker basically just bubbles all those up as-is. That said, I would also not personally implement a parse error for it in the HTML parser sources. But there's nothing that would prevent any other contributor (or code owner) for the parser code from implementing it. |

In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped.
This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand.
(See WHATWG Working Mode: Changes for more details.)