Remove the requirement to NOT include the charset within the Content-Type #10016

agowa · 2023-12-21T01:07:46Z

What is the issue with the HTML Standard?

The strict requirement to omit the charset from the Content-Type prevents it's usage as stated in RFC 7231.
I can see how not including it in the case of UTF-8 is ok, but as it is still allowed to use different charsets we shouldn't deny specifying which charset is actually used. Currently e.g. a application/x-www-form-urlencoded POST-Request can have any encoding and a receiving server will have to guess the encoding. A recipient cannot even rely on it being UTF-8 if not otherwise specified as web browsers will use whatever encoding the server used for the response to the GET-Request of the html document that contained the form. E.g. if the server sent the html using ISO 8859-15 encoding, then web browsers will silently use ISO 8859-15 and because of the current requirement to NOT specify a charset they will NOT correctly label their charset.

In some cases this is detectable as it will cause a decoding error, but in other cases it will have a valid decoding in multiple charsets. This type confusion can also cause e.g. a WAF (Web Application Firewall) to make incorrect decisions, cause incorrect logging, or worse.

It already caused issues for the sending browser (chromium) itself as it assumed UTF-8 decoding within it's own dev tools. This issue alone should illustrate the need for being precise with the content type and correctly advertise a non-default charset.

See also the related chromium issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1511226#c7

Tl;Dr: I'd like the Content-Type parsing to be changed to always use UTF-8 except if a different charset is specified. As well as when serializing a new HTTP-request to always add the charset (as specified in RFC 7231) to the Content-Type, EXCEPT if it is the default of UTF-8.

This should help resolve all the issues caused by the silent removal of charset within the living standard. Any RFC 7231 compliant parser should not have any issue with this change, as it is still within the RFC 7231s specification.

annevk · 2024-02-13T13:00:35Z

There's no silent removal. It's just never included for application/x-www-form-urlencoded. And that is a bit weird as in Fetch we do include it (it's always UTF-8 there).

It might be that we can still change this if Gecko transmit it (I thought it used to at least). If Gecko doesn't I'd be a bit more worried about compatibility fallout.

annevk added normative change needs implementer interest Moving the issue forward requires implementers to express interest topic: forms needs tests Moving the issue forward requires someone to write tests labels Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the requirement to NOT include the charset within the Content-Type #10016

Remove the requirement to NOT include the charset within the Content-Type #10016

agowa commented Dec 21, 2023

annevk commented Feb 13, 2024

Remove the requirement to NOT include the charset within the Content-Type #10016

Remove the requirement to NOT include the charset within the Content-Type #10016

Comments

agowa commented Dec 21, 2023

What is the issue with the HTML Standard?

annevk commented Feb 13, 2024