[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard

# Summary

Change encoding and decoding of UTF-8 to conform to the [WHATWG encoding standard](https://encoding.spec.whatwg.org/#utf-8-decoder). This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the `TextEncoder` and `TextDecoder` classes in JavaScript.

Related issues: https://github.com/dart-lang/sdk/issues/7046, https://github.com/dart-lang/sdk/issues/22330, https://github.com/dart-lang/sdk/issues/28832, https://github.com/dart-lang/sdk/issues/31370, https://github.com/dart-lang/sdk/issues/31954

# What is changing:

- When decoding UTF-8 data with the [`Utf8Codec`](https://api.dart.dev/stable/2.7.1/dart-convert/Utf8Codec-class.html) or [`Utf8Decoder`](https://api.dart.dev/stable/2.7.1/dart-convert/Utf8Decoder-class.html) class, the input is considered malformed if it contains an encoded surrogate character (code point in the range `U+D800`-`U+DFFF`, encoded in UTF-8 as a 3-byte character encoding where the first byte is `0xED` and the second byte is in the range `0xA0`-`0xBF`).
- When encoding a string as UTF-8 with the [`Utf8Codec`](https://api.dart.dev/stable/2.7.1/dart-convert/Utf8Codec-class.html) or [`Utf8Encoder`](https://api.dart.dev/stable/2.7.1/dart-convert/Utf8Encoder-class.html) class, and the string contains an unpaired surrogate, that surrogate is emitted as a *replacement character* (`U+FFFD`, encoded in UTF-8 as `0xEF`, `0xBF`, `0xBD`) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.
- When decoding malformed UTF-8 data with `allowMalformed` set to `true`, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either
  1. a prefix of a valid encoding of a character, or
  2. a single byte that is not a prefix of a valid encoding of a character.
- When decoding malformed UTF-8 data with `allowMalformed` set to `false`, the `offset` in the resulting `FormatException` will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, the `message` of the `FormatException` will sometimes be different from what it is currently. If the input contains more than one error, the `FormatException` may point to a different error than before.

# Why is this changing?

Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, [WTF-8](https://simonsapin.github.io/wtf-8/)) and decoded back into the original string.

This strategy has a number of drawbacks:
- The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
- When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
- The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS `TextEncoder` and `TextDecoder` classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.
- Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.

The purpose of the change is thus to:
- ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
- enable faster UTF-8 encoding and decoding for both JS and VM targets.

# Expected impact

Programs manipulating strings through usual string operations are unlikely to be affected.

A program may be affected by this change if it does any of the following:
1. Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
2. Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
3. Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
4. Relies on the number of replacement characters produced by decoding invalid UTF-8 data.

# Mitigation
For the scenarios listed above:
1. UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
2. For encoding in multiple chunks, use the [chunked conversion](https://api.dart.dev/stable/2.7.1/dart-convert/Utf8Encoder/startChunkedConversion.html) API.
3. Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
4. Same as 3.

# Variations
An optional `allowSurrogates` parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to `false`. This could introduce further breakage for programs implementing the `Utf8Codec` interface (unless we only put the flag on the constructors).

If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when `allowMalformed` is set to `true`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Summary

What is changing:

Why is this changing?

Expected impact

Mitigation

Variations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100

Description

Summary

What is changing:

Why is this changing?

Expected impact

Mitigation

Variations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions