Description
Summary
Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the TextEncoder
and TextDecoder
classes in JavaScript.
Related issues: #7046, #22330, #28832, #31370, #31954
What is changing:
- When decoding UTF-8 data with the
Utf8Codec
orUtf8Decoder
class, the input is considered malformed if it contains an encoded surrogate character (code point in the rangeU+D800
-U+DFFF
, encoded in UTF-8 as a 3-byte character encoding where the first byte is0xED
and the second byte is in the range0xA0
-0xBF
). - When encoding a string as UTF-8 with the
Utf8Codec
orUtf8Encoder
class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD
, encoded in UTF-8 as0xEF
,0xBF
,0xBD
) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before. - When decoding malformed UTF-8 data with
allowMalformed
set totrue
, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is either- a prefix of a valid encoding of a character, or
- a single byte that is not a prefix of a valid encoding of a character.
- When decoding malformed UTF-8 data with
allowMalformed
set tofalse
, theoffset
in the resultingFormatException
will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, themessage
of theFormatException
will sometimes be different from what it is currently. If the input contains more than one error, theFormatException
may point to a different error than before.
Why is this changing?
Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.
This strategy has a number of drawbacks:
- The output of the UTF-8 encoder is sometimes not valid UTF-8, which can be problematic when this data needs to be consumed by other programs.
- When UTF-8 data is read, and that data contains encoded unpaired surrogates, this may cause problems much later, when the string is processed, rather that catching the invalid encoding up front.
- The Dart behavior deviates from JS, which means that when Dart code is translated to JS, UTF-8 encoding and decoding can't directly use the JS
TextEncoder
andTextDecoder
classes. It must do some or all of the conversion in Dart code, which has a significant performance cost. - Retaining exact compatibility with the current error behavior complicates some planned optimizations to UTF-8 decoding in the Dart VM.
The purpose of the change is thus to:
- ensure that Dart programs don't inadvertently produce or accept invalid UTF-8.
- enable faster UTF-8 encoding and decoding for both JS and VM targets.
Expected impact
Programs manipulating strings through usual string operations are unlikely to be affected.
A program may be affected by this change if it does any of the following:
- Manipulates strings in a way that may introduce unpaired surrogates, encodes these strings as UTF-8, decodes them again and expects the string contents to be preserved.
- Encodes arbitrary substrings as UTF-8 without regard to surrogate pairs, decodes them again, concatenates them (before or after decoding) and expects the string contents to be preserved.
- Relies on the exact offsets and/or error messages from decoding invalid UTF-8 data or which error is reported in case of multiple errors.
- Relies on the number of replacement characters produced by decoding invalid UTF-8 data.
Mitigation
For the scenarios listed above:
- UTF-8, being an interchange format, is unsuited for representing such broken strings. Consider a different representation.
- For encoding in multiple chunks, use the chunked conversion API.
- Adjust the program to detect specific errors in a different way, or adapt it to the new errors.
- Same as 3.
Variations
An optional allowSurrogates
parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default to false
. This could introduce further breakage for programs implementing the Utf8Codec
interface (unless we only put the flag on the constructors).
If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when allowMalformed
is set to true
.