Skip to content

string literals: ASCIIString, UTF8String #4

Closed
@StefanKarpinski

Description

@StefanKarpinski

See the discussion here. The salient conclusion is this:

Escapes continue to work the way they do now: \x always inserts a single byte and \u always inserts a sequence of bytes encoding a unicode character. Literals are turned into String objects according to the following simple check:

  • ASCIIString if all bytes are < 0x80;
  • UTF8String if any bytes are ≥ 0x80.

If you want to use \x escapes with values at or above 0x80 to generate invalid UTF-8, that's your business. We can also introduce an Latin1"..." form that uses the Latin-1 encoding to store code points up to U+FF in an efficient character-per-byte form. Finally, the b"..." macro-defined string form can let you use characters and escapes (both \x and \u) to generate byte arrays.

We can safely and quickly concatenate ASCIIStrings with each other, with UTF8Strings, or with Latin1Strings. Mixing UTF8Strings and Latin1Strings, however, requires transcoding the Latin1Strings to UTF-8. This, however, will not occur with string literals since they will always be ASCIIStrings or UTF8Strings.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions