Revamp MIME type section #36

annevk · 2017-10-06T13:09:26Z

TODO:

Update the remainder of the document to use the new terminology
Add byte sequence parser and serializer
Define parameters as ASCII strings too

annevk · 2017-10-06T13:12:37Z

This contains just parsing MIME types. I haven't actually changed everything yet since I figured I'd ask for some feedback first.

My idea was to align this with URL. So we have MIME types, which are also known as MIME type records. And MIME type strings, which serve as input (maybe I'll define that part later, I'd like to focus on implementation requirements initially). We should probably also have some byte sequence entry points for most of these, but since going from bytes to strings is easy with isomorphic decode I thought making the main parser string-based is best.

Thoughts?

domenic · 2017-10-06T17:12:22Z

I like the general plan of aligning with URL. Although the spec is already like that, just a bit dated, right? (E.g. using multiple return values instead of a struct.)

I'd like to hear more justification for using strings as input, instead of bytes. What call sites use which? I'm not aware of how many call sites would use this parsing algorithm, so it's hard to judge. Bytes seems more correct though at first impression. If you do go with strings, it should be stated to be a JS string, I think.

The current spec has XXX boxes that are, as far as I can tell, designed to limit certain things to 127 bytes. Those have disappeared, but it seems important to do some testing there, or at least preserve some kind of XXX box.

annevk · 2017-10-06T17:35:28Z

XMLHttpRequest's overrideMimeType() in particular would be hard to define if it was based on bytes. You could roundtrip with UTF-8 and probably be fine though. But also, we have much more utilities for processing strings.

GPHemsley · 2017-10-07T05:38:28Z

Indeed, I believe some spec somewhere (an RFC, maybe?) had a limit of 127 bytes, but I questioned whether it was actually enforced by any implementation.

annevk · 2017-10-09T09:33:51Z

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

annevk · 2017-10-09T09:34:24Z

I also don't think we should state a specific string type, this works with all of them after all. Why require casts?

domenic · 2017-10-09T22:36:39Z

You're right, I forgot that our only two string types were JS string and SV string.

I'm still not convinced this should be strings instead of bytes though. The lack of utilities for parsing/manipulating byte sequences is fixable. I'd like to get an accounting of the call sites before we decide one way or another.

The XHR overrideMimeType() example is interesting precisely because I'm unsure how it should behave given that it operates on a DOMString (instead of a ByteString). What bytes do we actually transmit over HTTP? It seems like the spec right now sometimes stores a string in the "override MIME type" field, e.g. overrideMimeType() step 3, but sometimes stores a byte sequence, e.g. overrideMimeType() step 2.

Your version not only accepts strings as inputs, but also creates them as outputs in the MIME type struct, which seems quite bad if we're eventually sending these over the wire?

annevk · 2017-10-10T09:15:46Z

What bytes do we actually transmit over HTTP?

None, overrideMimeType() is an override for the response. We could use USVString and UTF-8 encode I think, without observable effects, but I'm not sure that's useful as there are other callers that want to operate on strings, such as several attributes defined in HTML. data: URLs and Content-Type need byte-based processing, but it's easy enough to make them use a wrapper.

annevk · 2017-10-10T09:16:46Z

And to be clear, I do intend to provide a "parse a MIME type from bytes" and "serialize a MIME type to bytes" for those cases, with the necessary asserts on the input for the latter of the two.

GPHemsley · 2017-10-14T00:50:29Z

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

The (supposed) 127-byte limit was on the individual portions (type, subtype, parameter name, parameter value), not the overall MIME type.

annevk · 2017-10-16T09:27:47Z

text/html;0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789=x;charset=gbk yields GBK. I also haven't found any evidence of it in implementations for which I could inspect the source code.

foolip · 2017-10-25T15:02:28Z

From some testing in #39 I think that we do need to preserve whitespace around separators. For charset normalization, bogus charsets should turn into UTF-8.

For charset=utf-8, it's debatable. It's plausible that stuff could depend on both the encoding being untouched and it being normalized. I think I have a slight preference to normalize it, since that is what document.charset is supposed to do.

domenic · 2017-11-14T13:46:49Z

We should test what happens with non-ASCII characters in the various segments of the MIME type; the spec seems to just ASCII lowercase the strings and pass them through. I wonder if that's how browsers treat them. And I wonder if browsers treat them differently when given HTTP header bytes or other byte-accepting entry points vs. XHR overrideMimeType or other string-accepting APIs.

domenic · 2017-11-14T13:49:13Z

As far as I can tell the proposed spec skips whitespace after the = but collects it before the = sign. http://httpwg.org/specs/rfc7231.html says no whitespace is allowed. We should test what browsers do in such scenarios if you haven't already. If you have, adding a note about this potentially-confusing mismatch would be good. Both that it mismatches the RFC, and that it treats before different than after.

annevk · 2017-11-14T14:06:14Z

It doesn't mismatch the RFC anymore than anything else that the RFC would say is invalid and is simply consumed as part of the token here, no? (There are tests for this already and browser bugs have been filed, see #34.)

domenic · 2017-11-14T14:24:15Z

Well, I don't see anything about consuming as part of a token in the RFC. But as far as I can tell the RFC has token both before and after the = sign, whereas you ignore whitespace before the = sign, but preserve it afterward.

I see you tested spaces before the = sign, but did you test spaces after?

annevk · 2017-11-14T16:46:16Z

How is it ignored before the = sign? I strip it after currently, but that should be dropped as the Encoding Standard handles that already for encodings.

domenic · 2017-11-14T17:21:30Z

Sorry, I got confused between my two posts. You ignore it after the = sign, but don't ignore it before the = sign. #34 has tests for before the = sign, which I guess is what led you to the behavior of not skipping whitespace there. I was wondering if there are any tests for after the = sign, which would help decide on spaces or no there.

It'd be ideal if there were a way of testing this independent of encoding handling, but I guess there is not in browsers today.

annevk · 2017-11-14T17:27:06Z

If browsers agree with all the proposed changes we could test it through data URLs and XMLHttpRequest, but that would first require browsers to actually start storing unknown parameters and such. So yeah, not in today's browsers.

domenic · 2017-11-14T17:29:15Z

Well, something to keep in mind at least; it'd be nice to write such tests and ask browsers to follow that model. But you're more in touch with whether that's realistic than I am.

annevk · 2017-11-14T17:48:29Z

I already wrote such a test: web-platform-tests/wpt#6890 (review).

See whatwg/mimesniff#36.

yutakahirano · 2017-12-01T04:09:58Z

mimesniff.bs

     <li>
-      Enter loop <var>M</var>:
+      <p>If the current <a>code point</a> in <var>input</var> is U+0022 ("), then advance


This part accepts (and ignores) unterminated quoted strings, such as 'text/plain; charset="utf-8' or 'text/plain; charset="utf-8; param2=value2'. I think it's too lax, what do you think? Is it OK to fail parsing in such cases?

Hmm, it seems your design principle is to let the parser parse type and subtype whenever they are sane (i.e., regardless of the "parameter" section). Is that right? Then this behavior may be OK...

When testing I found that only Safari would treat text/html;charset="gbk not as GBK. Therefore I went with the majority. But yes, nobody returned failure and started downloading such a resource, so I don't think we can start with that now.

domenic · 2017-12-01T23:30:17Z

I think "exclude parameters" on the serializer is useless because we already have "essence". Instead of saying e.g. "Let result be mimeType serialized without parameters", you'd just say "Let result be mimeType's essence".

annevk · 2017-12-04T12:28:36Z

Commit message:

Define a new MIME type model, parser, and serializer

This addresses all open inline issues with respect to the parser and serializer, aligns both closer with implementations, except where those stood in the way of an improved model.

This also updates all of it to make extensive use of the Infra Standard.

See #42 for the testing story (included all linked issues) and https://github.com/w3c/web-platform-tests/pull/7764 for the majority of tests.

For whatwg/mimesniff#36.

annevk · 2017-12-07T13:21:04Z

https://bugzilla.mozilla.org/show_bug.cgi?id=1423877
https://bugs.webkit.org/show_bug.cgi?id=180526
https://bugs.chromium.org/p/chromium/issues/detail?id=792880
https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/14995014/

This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared.

This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared. Finally, it updates the terms "explicitly supported XML/JSON type" to include the word "MIME", like other MIME type group definitions now do.

See whatwg/xhr#176 and whatwg/mimesniff#36.

…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422

This follows whatwg/mimesniff#58 by referencing the definitions for JavaScript and JSON MIME type that now live in MIME Sniffing. It also follows whatwg/mimesniff#36 by using the terms "valid MIME type string" and "valid MIME type string without parameters" instead of their non-string counterparts that previously appeared. Finally, it updates the terms "explicitly supported XML/JSON type" to include the word "MIME", like other MIME type group definitions now do.

…ng, a=testonly Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling See whatwg/xhr#176 and whatwg/mimesniff#36. -- wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4 wpt-pr: 8422 UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745

The term "parsable MIME type" used to be part of the MIME Sniffing standard, but it was removed in whatwg/mimesniff#36. This change replaces its uses with equivalent phrasing that references the "parse a MIME type" algorithm. It also replaces mentions of "ASCII-encoded strings" with the Infra standard's definition of "ASCII string". Closes w3c#170.

annevk mentioned this pull request Oct 9, 2017

Parsable MIME type is going away w3c/preload#113

Closed

annevk mentioned this pull request Oct 10, 2017

Stop using "alphabetical" #18

Open

domenic mentioned this pull request Oct 10, 2017

Do separators need to be preserved when parsing? #39

Closed

annevk mentioned this pull request Nov 24, 2017

MIME type parsing, stricter rules #44

Closed

annevk added a commit to web-platform-tests/wpt that referenced this pull request Nov 24, 2017

Adjust XMLHttpRequest Content-Type handling

be75de8

See whatwg/mimesniff#36.

This was referenced Nov 24, 2017

Adjust XMLHttpRequest Content-Type handling web-platform-tests/wpt#8422

Merged

Sort out MIME type tests #42

Closed

MIME type parsing, code points #45

Closed

Look at overrideMimeType() again whatwg/xhr#157

Closed

yutakahirano reviewed Dec 1, 2017

View reviewed changes

domenic added a commit to jsdom/whatwg-mimetype that referenced this pull request Dec 2, 2017

Remove excludeParameters per whatwg/mimesniff#36 (comment)

9884aa3

annevk added 3 commits December 4, 2017 10:05

Address review feedback

33eedbc

preserve IDs HTML relies on and some that make sense not to break

534a51d

remove upstreamed ref

3f70580

domenic mentioned this pull request Dec 4, 2017

Don't include fragment in data: URL body jsdom/jsdom#2073

Merged

annevk mentioned this pull request Dec 5, 2017

Define Content-Type manipulation in terms of MIME Sniffing whatwg/xhr#176

Merged

annevk merged commit cc81ec4 into master Dec 7, 2017

annevk deleted the annevk/mime-type branch December 7, 2017 13:11

annevk added a commit to web-platform-tests/wpt that referenced this pull request Dec 7, 2017

MIME type parsing tests

b15e885

For whatwg/mimesniff#36.

SimonSapin mentioned this pull request Feb 1, 2018

Content-Type parsing (MIME type parsing) #30

Closed

domenic mentioned this pull request Feb 5, 2018

Editorial: update usage of the MIME Sniffing Standard whatwg/html#3455

Merged

annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 10, 2018

Adjust XMLHttpRequest Content-Type handling

6a721f9

See whatwg/xhr#176 and whatwg/mimesniff#36.

annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 16, 2018

Adjust XMLHttpRequest Content-Type handling

84e7972

See whatwg/xhr#176 and whatwg/mimesniff#36.

annevk mentioned this pull request Sep 10, 2018

Allow quoted empty string MIME type parameter values #79

Merged

andreubotella mentioned this pull request Apr 26, 2021

Issues with MIME types w3c/FileAPI#170

Open

andreubotella mentioned this pull request May 11, 2021

Editorial: Remove any references to "parsable MIME type" w3c/FileAPI#172

Open

4 tasks

GPHemsley mentioned this pull request Sep 14, 2024

Algorithm for determining what is and is not a MIME type seems to be undefined. #194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp MIME type section #36

Revamp MIME type section #36

annevk commented Oct 6, 2017 •

edited by pr-preview bot

Loading

annevk commented Oct 6, 2017

domenic commented Oct 6, 2017

annevk commented Oct 6, 2017

GPHemsley commented Oct 7, 2017

annevk commented Oct 9, 2017

annevk commented Oct 9, 2017

domenic commented Oct 9, 2017

annevk commented Oct 10, 2017

annevk commented Oct 10, 2017

GPHemsley commented Oct 14, 2017

annevk commented Oct 16, 2017

foolip commented Oct 25, 2017

domenic commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017 •

edited

Loading

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

yutakahirano Dec 1, 2017

yutakahirano Dec 1, 2017

annevk Dec 1, 2017

domenic commented Dec 1, 2017

annevk commented Dec 4, 2017

annevk commented Dec 7, 2017

Revamp MIME type section #36

Revamp MIME type section #36

Conversation

annevk commented Oct 6, 2017 • edited by pr-preview bot Loading

annevk commented Oct 6, 2017

domenic commented Oct 6, 2017

annevk commented Oct 6, 2017

GPHemsley commented Oct 7, 2017

annevk commented Oct 9, 2017

annevk commented Oct 9, 2017

domenic commented Oct 9, 2017

annevk commented Oct 10, 2017

annevk commented Oct 10, 2017

GPHemsley commented Oct 14, 2017

annevk commented Oct 16, 2017

foolip commented Oct 25, 2017

domenic commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017 • edited Loading

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

domenic commented Nov 14, 2017

annevk commented Nov 14, 2017

yutakahirano Dec 1, 2017

Choose a reason for hiding this comment

yutakahirano Dec 1, 2017

Choose a reason for hiding this comment

annevk Dec 1, 2017

Choose a reason for hiding this comment

domenic commented Dec 1, 2017

annevk commented Dec 4, 2017

annevk commented Dec 7, 2017

annevk commented Oct 6, 2017 •

edited by pr-preview bot

Loading

annevk commented Nov 14, 2017 •

edited

Loading