Skip to content

allow full unicode range #687

Closed
Closed
@andimarek

Description

@andimarek

These are my proposed changes to the spec to allow for full unicode range (currently it is restricted to BMP code points. See SourceCharacter)

1. Change SourceCharacter to allow also code points between 0xFFFF and 0x10FFFF (outside of the BMP):

SourceCharacter ::
  - "U+0009"
  - "U+000A"
  - "U+000D"
  - "U+0020–U+10FFFF"

This does not cover all unicode code points: most of the Control Characters are not allowed. This is the same behavior as now and I don't see a reason to change it: the only places where Control Characters could be allowed are inside comments or String literals. Inside Strings you can escape them and inside comments they don't really make sense or you can easily work around it.

Changing it to allow for Control Characters to be included would also add an additional burden on systems processing GraphQL documents. Most importantly JSON also requires Control Characters to be escaped (https://tools.ietf.org/html/rfc8259#section-7).

2. Allow surrogate code pair escapes in standard quoted strings:

Currently standard quoted strings allow for BMP code points to be escaped. (Via \u<4-digit-hex-value>.) In order to align this with the SourceCharacter change above the spec should allow also code points outside of the BMP to be escaped. Surrogate Pairs are the most direct way to allow for that.

For example the unicode code point U+1F37A ( 🍺 ) which is outside of the BMP can be escaped as \ud83c\udf7a

There are other escapes sequences used for code points outside of the BMP. For example JS and others allow for \u{1F37A}. But this would introduce a new syntax. I argue that surrogate code pairs are the most compatible and simplest option. JSON for example understands surrogate code pairs but not \u{1F37A}.

One small open question is how illegal surrogate pairs should be handled:
For example \ud83c\u0020 or \uDEAD is such an illegal pair.

The JS spec says:

A code unit that is a leading surrogate or trailing surrogate, but is not part of a surrogate pair, is interpreted as a code point with the same value.

The JSON spec notes:

However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate). Instances of this have been observed, for example, when
a library truncates a UTF-16 string without checking whether the
truncation split a surrogate pair. The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.

I would recommend to add a section that servers should try to reject illegal surrogate pairs if possible in order to avoid unexpected behavior.

Previous discussion

Previous Issue about it: #214
Previous PR which was not merged: #231

Please comment, leave feedback.

The JS PR for this change is here: graphql/graphql-js#2449

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions