Remove the requirement to NOT include the charset within the Content-Type #10016
Labels
needs implementer interest
Moving the issue forward requires implementers to express interest
needs tests
Moving the issue forward requires someone to write tests
normative change
topic: forms
What is the issue with the HTML Standard?
The strict requirement to omit the charset from the Content-Type prevents it's usage as stated in RFC 7231.
I can see how not including it in the case of UTF-8 is ok, but as it is still allowed to use different charsets we shouldn't deny specifying which charset is actually used. Currently e.g. a application/x-www-form-urlencoded POST-Request can have any encoding and a receiving server will have to guess the encoding. A recipient cannot even rely on it being UTF-8 if not otherwise specified as web browsers will use whatever encoding the server used for the response to the GET-Request of the html document that contained the form. E.g. if the server sent the html using ISO 8859-15 encoding, then web browsers will silently use ISO 8859-15 and because of the current requirement to NOT specify a charset they will NOT correctly label their charset.
In some cases this is detectable as it will cause a decoding error, but in other cases it will have a valid decoding in multiple charsets. This type confusion can also cause e.g. a WAF (Web Application Firewall) to make incorrect decisions, cause incorrect logging, or worse.
It already caused issues for the sending browser (chromium) itself as it assumed UTF-8 decoding within it's own dev tools. This issue alone should illustrate the need for being precise with the content type and correctly advertise a non-default charset.
See also the related chromium issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1511226#c7
Tl;Dr: I'd like the Content-Type parsing to be changed to always use UTF-8 except if a different charset is specified. As well as when serializing a new HTTP-request to always add the charset (as specified in RFC 7231) to the Content-Type, EXCEPT if it is the default of UTF-8.
This should help resolve all the issues caused by the silent removal of charset within the living standard. Any RFC 7231 compliant parser should not have any issue with this change, as it is still within the RFC 7231s specification.
The text was updated successfully, but these errors were encountered: