Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the requirement to NOT include the charset within the Content-Type #10016

Open
agowa opened this issue Dec 21, 2023 · 1 comment
Open
Labels
needs implementer interest Moving the issue forward requires implementers to express interest needs tests Moving the issue forward requires someone to write tests normative change topic: forms

Comments

@agowa
Copy link

agowa commented Dec 21, 2023

What is the issue with the HTML Standard?

The strict requirement to omit the charset from the Content-Type prevents it's usage as stated in RFC 7231.
I can see how not including it in the case of UTF-8 is ok, but as it is still allowed to use different charsets we shouldn't deny specifying which charset is actually used. Currently e.g. a application/x-www-form-urlencoded POST-Request can have any encoding and a receiving server will have to guess the encoding. A recipient cannot even rely on it being UTF-8 if not otherwise specified as web browsers will use whatever encoding the server used for the response to the GET-Request of the html document that contained the form. E.g. if the server sent the html using ISO 8859-15 encoding, then web browsers will silently use ISO 8859-15 and because of the current requirement to NOT specify a charset they will NOT correctly label their charset.

In some cases this is detectable as it will cause a decoding error, but in other cases it will have a valid decoding in multiple charsets. This type confusion can also cause e.g. a WAF (Web Application Firewall) to make incorrect decisions, cause incorrect logging, or worse.

It already caused issues for the sending browser (chromium) itself as it assumed UTF-8 decoding within it's own dev tools. This issue alone should illustrate the need for being precise with the content type and correctly advertise a non-default charset.

See also the related chromium issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1511226#c7

Tl;Dr: I'd like the Content-Type parsing to be changed to always use UTF-8 except if a different charset is specified. As well as when serializing a new HTTP-request to always add the charset (as specified in RFC 7231) to the Content-Type, EXCEPT if it is the default of UTF-8.

This should help resolve all the issues caused by the silent removal of charset within the living standard. Any RFC 7231 compliant parser should not have any issue with this change, as it is still within the RFC 7231s specification.

@annevk annevk added normative change needs implementer interest Moving the issue forward requires implementers to express interest topic: forms needs tests Moving the issue forward requires someone to write tests labels Feb 13, 2024
@annevk
Copy link
Member

annevk commented Feb 13, 2024

There's no silent removal. It's just never included for application/x-www-form-urlencoded. And that is a bit weird as in Fetch we do include it (it's always UTF-8 there).

It might be that we can still change this if Gecko transmit it (I thought it used to at least). If Gecko doesn't I'd be a bit more worried about compatibility fallout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs implementer interest Moving the issue forward requires implementers to express interest needs tests Moving the issue forward requires someone to write tests normative change topic: forms
Development

No branches or pull requests

2 participants