Skip to content

Getting all bytes in a body #661

Closed
@domenic

Description

@domenic

(This spans HTML, Encoding, Streams, and Fetch. Let's tag at least @ricea to help while we're here, and @jakearchibald since he was helpful in whatwg/infra#181.)

I need help figuring out how at a spec level to get all bytes in a body. This is part of solving whatwg/html#3316.

Right now specs are using a few patterns:

In general the problem here is Streams use of the JS formalism, including promises, and how that interacts poorly with spec-level code and its different conventions and type system. The confusion between encoding's streams and Streams's streams is also tricky.

Looking at https://html.spec.whatwg.org/#fetch-a-classic-script my first thought is that it should be written as:

  1. Let body bytes be the byte sequence obtained by [reading all the bytes] of the [body] of response.
  2. Let source text be the result of [decoding] body bytes to Unicode, using character encoding as the fallback encoding.
  3. ... muted errors ...
  4. Let script be the result of creating a classic script given body bytes, source text, ... the other stuff.

This doesn't quite work on a few levels:

  • "reading all the bytes" would at the very least need to be asynchronous, if it goes through Streams's promise machinery.
  • "reading all the bytes" might fail. This is currently unaccounted for, but if we go through Streams's machinery, it's more explicit.
  • "reading all the bytes" will need to hand-wave to get from Uint8Array to byte sequence.
  • "decoding" doesn't accept byte sequences, only "byte streams".

So here is my proposal, which is essentially trying for a minimal delta from today:

  • We define "reading all the bytes", probably in whatwg/fetch. It returns either a byte sequence, or null to signal failure. Either its asynchronous, or we use "wait" (see Define async algorithms (or "spec promises"???) infra#181) to make it synchronous. It wraps up all the promise/Uint8Array stuff so that spec authors don't have to worry about it, hand-waving as appropriate.
  • We define in Encoding that "byte sequences" can be used as "byte streams" implicitly, when appropriate.

An alternate approach, which you might prefer, is to double-down on the spec-level concept of a byte stream. We'd provide some way of translating a ReadableStream, and thus a body, into a spec-level byte stream. It creates a reader, locks the ReadableStream, and then from then on, specs only manipulate the byte stream.

The hardest part of this, I think, is putting the asynchronicity of "reading" from the byte stream on solid ground. It seems very hand-wavey in Encoding, and most of Encoding's clients, right now. We could either:

  • Say that you can read "synchronously", in that it waits for more data to come in before "read from a byte stream" returns
  • or say that read needs to be asynchronous

This ties back to whatwg/infra#181 again. The problem with synchronous reads is that you can only do them from in-parallel sections, and I think in most cases "decode" does not run in those sections.

Indeed, the larger issue where encode/decode often are used on strings/byte sequences, instead of on "character streams"/"byte streams", seems pretty prevalent: see e.g. https://html.spec.whatwg.org/#form-submission-algorithm:encode or https://html.spec.whatwg.org/#navigating-across-documents:utf-8-decode. So maybe we need to do something about that anyway. Those cases wouldn't need to run in parallel.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions