Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp MIME type section #36

Merged
merged 18 commits into from
Dec 7, 2017
Merged

Revamp MIME type section #36

merged 18 commits into from
Dec 7, 2017

Conversation

annevk
Copy link
Member

@annevk annevk commented Oct 6, 2017

TODO:

  • Update the remainder of the document to use the new terminology
  • Add byte sequence parser and serializer
  • Define parameters as ASCII strings too

Preview | Diff

@annevk
Copy link
Member Author

annevk commented Oct 6, 2017

This contains just parsing MIME types. I haven't actually changed everything yet since I figured I'd ask for some feedback first.

My idea was to align this with URL. So we have MIME types, which are also known as MIME type records. And MIME type strings, which serve as input (maybe I'll define that part later, I'd like to focus on implementation requirements initially). We should probably also have some byte sequence entry points for most of these, but since going from bytes to strings is easy with isomorphic decode I thought making the main parser string-based is best.

Thoughts?

@domenic
Copy link
Member

domenic commented Oct 6, 2017

I like the general plan of aligning with URL. Although the spec is already like that, just a bit dated, right? (E.g. using multiple return values instead of a struct.)

I'd like to hear more justification for using strings as input, instead of bytes. What call sites use which? I'm not aware of how many call sites would use this parsing algorithm, so it's hard to judge. Bytes seems more correct though at first impression. If you do go with strings, it should be stated to be a JS string, I think.

The current spec has XXX boxes that are, as far as I can tell, designed to limit certain things to 127 bytes. Those have disappeared, but it seems important to do some testing there, or at least preserve some kind of XXX box.

@annevk
Copy link
Member Author

annevk commented Oct 6, 2017

XMLHttpRequest's overrideMimeType() in particular would be hard to define if it was based on bytes. You could roundtrip with UTF-8 and probably be fine though. But also, we have much more utilities for processing strings.

@GPHemsley
Copy link
Member

Indeed, I believe some spec somewhere (an RFC, maybe?) had a limit of 127 bytes, but I questioned whether it was actually enforced by any implementation.

@annevk
Copy link
Member Author

annevk commented Oct 9, 2017

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

@annevk
Copy link
Member Author

annevk commented Oct 9, 2017

I also don't think we should state a specific string type, this works with all of them after all. Why require casts?

@domenic
Copy link
Member

domenic commented Oct 9, 2017

You're right, I forgot that our only two string types were JS string and SV string.

I'm still not convinced this should be strings instead of bytes though. The lack of utilities for parsing/manipulating byte sequences is fixable. I'd like to get an accounting of the call sites before we decide one way or another.

The XHR overrideMimeType() example is interesting precisely because I'm unsure how it should behave given that it operates on a DOMString (instead of a ByteString). What bytes do we actually transmit over HTTP? It seems like the spec right now sometimes stores a string in the "override MIME type" field, e.g. overrideMimeType() step 3, but sometimes stores a byte sequence, e.g. overrideMimeType() step 2.

Your version not only accepts strings as inputs, but also creates them as outputs in the MIME type struct, which seems quite bad if we're eventually sending these over the wire?

@annevk
Copy link
Member Author

annevk commented Oct 10, 2017

What bytes do we actually transmit over HTTP?

None, overrideMimeType() is an override for the response. We could use USVString and UTF-8 encode I think, without observable effects, but I'm not sure that's useful as there are other callers that want to operate on strings, such as several attributes defined in HTML. data: URLs and Content-Type need byte-based processing, but it's easy enough to make them use a wrapper.

@annevk
Copy link
Member Author

annevk commented Oct 10, 2017

And to be clear, I do intend to provide a "parse a MIME type from bytes" and "serialize a MIME type to bytes" for those cases, with the necessary asserts on the input for the latter of the two.

@GPHemsley
Copy link
Member

Content-Type: text/html;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;123456789;charset=gbk results in GBK in all implementations. We also generally don't do limits, so it seems good to get rid of that.

The (supposed) 127-byte limit was on the individual portions (type, subtype, parameter name, parameter value), not the overall MIME type.

@annevk
Copy link
Member Author

annevk commented Oct 16, 2017

text/html;0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789=x;charset=gbk yields GBK. I also haven't found any evidence of it in implementations for which I could inspect the source code.

@foolip
Copy link
Member

foolip commented Oct 25, 2017

From some testing in #39 I think that we do need to preserve whitespace around separators. For charset normalization, bogus charsets should turn into UTF-8.

For charset=utf-8, it's debatable. It's plausible that stuff could depend on both the encoding being untouched and it being normalized. I think I have a slight preference to normalize it, since that is what document.charset is supposed to do.

@domenic
Copy link
Member

domenic commented Nov 14, 2017

We should test what happens with non-ASCII characters in the various segments of the MIME type; the spec seems to just ASCII lowercase the strings and pass them through. I wonder if that's how browsers treat them. And I wonder if browsers treat them differently when given HTTP header bytes or other byte-accepting entry points vs. XHR overrideMimeType or other string-accepting APIs.

@domenic
Copy link
Member

domenic commented Nov 14, 2017

As far as I can tell the proposed spec skips whitespace after the = but collects it before the = sign. http://httpwg.org/specs/rfc7231.html says no whitespace is allowed. We should test what browsers do in such scenarios if you haven't already. If you have, adding a note about this potentially-confusing mismatch would be good. Both that it mismatches the RFC, and that it treats before different than after.

@annevk
Copy link
Member Author

annevk commented Nov 14, 2017

It doesn't mismatch the RFC anymore than anything else that the RFC would say is invalid and is simply consumed as part of the token here, no? (There are tests for this already and browser bugs have been filed, see #34.)

@domenic
Copy link
Member

domenic commented Nov 14, 2017

Well, I don't see anything about consuming as part of a token in the RFC. But as far as I can tell the RFC has token both before and after the = sign, whereas you ignore whitespace before the = sign, but preserve it afterward.

I see you tested spaces before the = sign, but did you test spaces after?

@annevk
Copy link
Member Author

annevk commented Nov 14, 2017

How is it ignored before the = sign? I strip it after currently, but that should be dropped as the Encoding Standard handles that already for encodings.

@domenic
Copy link
Member

domenic commented Nov 14, 2017

Sorry, I got confused between my two posts. You ignore it after the = sign, but don't ignore it before the = sign. #34 has tests for before the = sign, which I guess is what led you to the behavior of not skipping whitespace there. I was wondering if there are any tests for after the = sign, which would help decide on spaces or no there.

It'd be ideal if there were a way of testing this independent of encoding handling, but I guess there is not in browsers today.

@annevk
Copy link
Member Author

annevk commented Nov 14, 2017

If browsers agree with all the proposed changes we could test it through data URLs and XMLHttpRequest, but that would first require browsers to actually start storing unknown parameters and such. So yeah, not in today's browsers.

@domenic
Copy link
Member

domenic commented Nov 14, 2017

Well, something to keep in mind at least; it'd be nice to write such tests and ask browsers to follow that model. But you're more in touch with whether that's realistic than I am.

@annevk
Copy link
Member Author

annevk commented Nov 14, 2017

I already wrote such a test: web-platform-tests/wpt#6890 (review).

mimesniff.bs Outdated
<li>
Enter loop <var>M</var>:
<p>If the current <a>code point</a> in <var>input</var> is U+0022 ("), then advance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part accepts (and ignores) unterminated quoted strings, such as 'text/plain; charset="utf-8' or 'text/plain; charset="utf-8; param2=value2'. I think it's too lax, what do you think? Is it OK to fail parsing in such cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems your design principle is to let the parser parse type and subtype whenever they are sane (i.e., regardless of the "parameter" section). Is that right? Then this behavior may be OK...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing I found that only Safari would treat text/html;charset="gbk not as GBK. Therefore I went with the majority. But yes, nobody returned failure and started downloading such a resource, so I don't think we can start with that now.

@domenic
Copy link
Member

domenic commented Dec 1, 2017

I think "exclude parameters" on the serializer is useless because we already have "essence". Instead of saying e.g. "Let result be mimeType serialized without parameters", you'd just say "Let result be mimeType's essence".

domenic added a commit to jsdom/whatwg-mimetype that referenced this pull request Dec 2, 2017
@annevk
Copy link
Member Author

annevk commented Dec 4, 2017

Commit message:

Define a new MIME type model, parser, and serializer

This addresses all open inline issues with respect to the parser and serializer, aligns both closer with implementations, except where those stood in the way of an improved model.

This also updates all of it to make extensive use of the Infra Standard.

See #42 for the testing story (included all linked issues) and https://github.com/w3c/web-platform-tests/pull/7764 for the majority of tests.

@annevk annevk merged commit cc81ec4 into master Dec 7, 2017
@annevk annevk deleted the annevk/mime-type branch December 7, 2017 13:11
annevk added a commit to web-platform-tests/wpt that referenced this pull request Dec 7, 2017
domenic added a commit to whatwg/html that referenced this pull request Feb 5, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared.
domenic added a commit to whatwg/html that referenced this pull request Feb 16, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared.
domenic added a commit to whatwg/html that referenced this pull request Feb 16, 2018
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared. Finally, it updates the terms "explicitly supported
XML/JSON type" to include the word "MIME", like other MIME type group
definitions now do.
annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 10, 2018
annevk added a commit to web-platform-tests/wpt that referenced this pull request Apr 16, 2018
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Apr 26, 2018
…ng, a=testonly

Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling

See whatwg/xhr#176 and whatwg/mimesniff#36.
--

wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4
wpt-pr: 8422
alice pushed a commit to alice/html that referenced this pull request Jan 8, 2019
This follows whatwg/mimesniff#58 by referencing
the definitions for JavaScript and JSON MIME type that now live in MIME
Sniffing. It also follows whatwg/mimesniff#36 by
using the terms "valid MIME type string" and "valid MIME type string
without parameters" instead of their non-string counterparts that
previously appeared. Finally, it updates the terms "explicitly supported
XML/JSON type" to include the word "MIME", like other MIME type group
definitions now do.
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Oct 2, 2019
…ng, a=testonly

Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling

See whatwg/xhr#176 and whatwg/mimesniff#36.
--

wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4
wpt-pr: 8422

UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Oct 3, 2019
…ng, a=testonly

Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling

See whatwg/xhr#176 and whatwg/mimesniff#36.
--

wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4
wpt-pr: 8422

UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Oct 3, 2019
…ng, a=testonly

Automatic update from web-platform-testsAdjust XMLHttpRequest Content-Type handling

See whatwg/xhr#176 and whatwg/mimesniff#36.
--

wpt-commits: 84e7972a0518fb57f39740143d4b63e79b14e9f4
wpt-pr: 8422

UltraBlame original commit: be93580d93e3b6e94946124f06c188bc8daac745
andreubotella pushed a commit to andreubotella/FileAPI that referenced this pull request May 11, 2021
The term "parsable MIME type" used to be part of the MIME Sniffing
standard, but it was removed in whatwg/mimesniff#36. This change
replaces its uses with equivalent phrasing that references the "parse a
MIME type" algorithm. It also replaces mentions of "ASCII-encoded
strings" with the Infra standard's definition of "ASCII string".

Closes w3c#170.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants