Skip to content

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Closed
@askeksa-google

Description

@askeksa-google

Summary

Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the TextEncoder and TextDecoder classes in JavaScript.

Related issues: #7046, #22330, #28832, #31370, #31954

What is changing:

  • When decoding UTF-8 data with the Utf8Codec or Utf8Decoder class, the input is considered malformed if it contains an encoded surrogate character (code point in the range U+D800-U+DFFF, encoded in UTF-8 as a 3-byte character encoding where the first byte is 0xED and the second byte is in the range 0xA0-0xBF).
  • When encoding a string as UTF-8 with the Utf8Codec or Utf8Encoder class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD, encoded in UTF-8 as 0xEF, 0xBF, 0xBD) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.
  • When decoding malformed UTF-8 data with allowMalformed set to true, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either
    1. a prefix of a valid encoding of a character, or
    2. a single byte that is not a prefix of a valid encoding of a character.
  • When decoding malformed UTF-8 data with allowMalformed set to false, the offset in the resulting FormatException will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, the message of the FormatException will sometimes be different from what it is currently. If the input contains more than one error, the FormatException may point to a different error than before.

Why is this changing?

Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.

This strategy has a number of drawbacks:

  • The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
  • When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
  • The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS TextEncoder and TextDecoder classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.
  • Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.

The purpose of the change is thus to:

  • ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
  • enable faster UTF-8 encoding and decoding for both JS and VM targets.

Expected impact

Programs manipulating strings through usual string operations are unlikely to be affected.

A program may be affected by this change if it does any of the following:

  1. Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
  2. Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
  3. Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
  4. Relies on the number of replacement characters produced by decoding invalid UTF-8 data.

Mitigation

For the scenarios listed above:

  1. UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
  2. For encoding in multiple chunks, use the chunked conversion API.
  3. Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
  4. Same as 3.

Variations

An optional allowSurrogates parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to false. This could introduce further breakage for programs implementing the Utf8Codec interface (unless we only put the flag on the constructors).

If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when allowMalformed is set to true.

Metadata

Metadata

Labels

area-core-librarySDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries.breaking-change-requestThis tracks requests for feedback on breaking changeslibrary-convert

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions