Skip to content

Commit 383251f

Browse files
committed
RFC: Allow full unicode range
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
1 parent 558f477 commit 383251f

File tree

2 files changed

+74
-44
lines changed

2 files changed

+74
-44
lines changed

spec/Appendix B -- Grammar Summary.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,20 @@ StringValue ::
9797
- `"""` BlockStringCharacter* `"""`
9898

9999
StringCharacter ::
100-
- SourceCharacter but not `"` or \ or LineTerminator
101-
- \u EscapedUnicode
102-
- \ EscapedCharacter
100+
- SourceCharacter but not `"` or `\` or LineTerminator
101+
- `\u` EscapedUnicode
102+
- `\` EscapedCharacter
103103

104-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
104+
EscapedUnicode ::
105+
- HexDigit HexDigit HexDigit HexDigit
106+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
105107

106-
EscapedCharacter :: one of `"` \ `/` b f n r t
108+
HexDigit :: one of
109+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
110+
- `A` `B` `C` `D` `E` `F`
111+
- `a` `b` `c` `d` `e` `f`
112+
113+
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
107114

108115
BlockStringCharacter ::
109116
- SourceCharacter but not `"""` or `\"""`

spec/Section 2 -- Language.md

Lines changed: 62 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -50,23 +50,24 @@ SourceCharacter ::
5050
- "U+0009"
5151
- "U+000A"
5252
- "U+000D"
53-
- "U+0020–U+FFFF"
53+
- "U+0020–U+10FFFF"
5454

5555
GraphQL documents are expressed as a sequence of
56-
[Unicode](https://unicode.org/standard/standard.html) characters. However, with
56+
[Unicode](https://unicode.org/standard/standard.html) code points (informally
57+
referred to as *"characters"* through most of this specification). However, with
5758
few exceptions, most of GraphQL is expressed only in the original non-control
5859
ASCII range so as to be as widely compatible with as many existing tools,
5960
languages, and serialization formats as possible and avoid display issues in
6061
text editors and source control.
6162

63+
Note: Non-ASCII Unicode code points may freely appear within {StringValue} and
64+
{Comment} tokens.
65+
6266

6367
### Unicode
6468

6569
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
6670

67-
Non-ASCII Unicode characters may freely appear within {StringValue} and
68-
{Comment} portions of GraphQL.
69-
7071
The "Byte Order Mark" is a special Unicode character which
7172
may appear at the beginning of a file containing Unicode which programs may use
7273
to determine the fact that the text stream is Unicode, what endianness the text
@@ -804,13 +805,20 @@ StringValue ::
804805
- `"""` BlockStringCharacter* `"""`
805806

806807
StringCharacter ::
807-
- SourceCharacter but not `"` or \ or LineTerminator
808-
- \u EscapedUnicode
809-
- \ EscapedCharacter
808+
- SourceCharacter but not `"` or `\` or LineTerminator
809+
- `\u` EscapedUnicode
810+
- `\` EscapedCharacter
811+
812+
EscapedUnicode ::
813+
- HexDigit HexDigit HexDigit HexDigit
814+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
810815

811-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
816+
HexDigit :: one of
817+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
818+
- `A` `B` `C` `D` `E` `F`
819+
- `a` `b` `c` `d` `e` `f`
812820

813-
EscapedCharacter :: one of `"` \ `/` b f n r t
821+
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
814822

815823
BlockStringCharacter ::
816824
- SourceCharacter but not `"""` or `\"""`
@@ -825,9 +833,9 @@ be interpreted as the beginning of a block string. As an example, the source
825833
{`""""""`} can only be interpreted as a single empty block string and not three
826834
empty strings.
827835

828-
Non-ASCII Unicode characters are allowed within single-quoted strings.
829-
Since {SourceCharacter} must not contain some ASCII control characters, escape
830-
sequences must be used to represent these characters. The {`\`}, {`"`}
836+
Non-ASCII Unicode characters are allowed within single-quoted strings.
837+
Since {SourceCharacter} must not contain some ASCII control characters, escape
838+
sequences must be used to represent these characters. The {`\`}, {`"`}
831839
characters also must be escaped. All other escape sequences are optional.
832840

833841
**Block Strings**
@@ -892,32 +900,47 @@ StringValue :: `""`
892900

893901
StringValue :: `"` StringCharacter+ `"`
894902

895-
* Return the Unicode character sequence of all {StringCharacter}
896-
Unicode character values.
897-
898-
StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator
899-
900-
* Return the character value of {SourceCharacter}.
901-
902-
StringCharacter :: \u EscapedUnicode
903-
904-
* Return the character whose code unit value in the Unicode Basic Multilingual
905-
Plane is the 16-bit hexadecimal value {EscapedUnicode}.
906-
907-
StringCharacter :: \ EscapedCharacter
908-
909-
* Return the character value of {EscapedCharacter} according to the table below.
910-
911-
| Escaped Character | Code Unit Value | Character Name |
912-
| ----------------- | --------------- | ---------------------------- |
913-
| `"` | U+0022 | double quote |
914-
| `\` | U+005C | reverse solidus (back slash) |
915-
| `/` | U+002F | solidus (forward slash) |
916-
| `b` | U+0008 | backspace |
917-
| `f` | U+000C | form feed |
918-
| `n` | U+000A | line feed (new line) |
919-
| `r` | U+000D | carriage return |
920-
| `t` | U+0009 | horizontal tab |
903+
* Let {string} be the sequence of all {StringCharacter} code points.
904+
* For each {codePoint} at {index} in {string}:
905+
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
906+
* Let {lowPoint} be the code point at {index} + {1} in {string}.
907+
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
908+
* Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
909+
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
910+
* Else assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
911+
* Return {string}.
912+
913+
Note: {StringValue} should avoid encoding code points as surrogate pairs.
914+
While services must interpret them accordingly, a braced escape (for example
915+
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
916+
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
917+
918+
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
919+
920+
* Return the code point {SourceCharacter}.
921+
922+
StringCharacter :: `\u` EscapedUnicode
923+
924+
* Let {value} be the 21-bit hexadecimal value represented by the sequence of
925+
{HexDigit} within {EscapedUnicode}.
926+
* Assert {value} <= 0x10FFFF.
927+
* Return the code point {value}.
928+
929+
StringCharacter :: `\` EscapedCharacter
930+
931+
* Return the code point represented by {EscapedCharacter} according to the
932+
table below.
933+
934+
| Escaped Character | Code Point | Character Name |
935+
| ----------------- | ---------- | ---------------------------- |
936+
| {`"`} | U+0022 | double quote |
937+
| {`\`} | U+005C | reverse solidus (back slash) |
938+
| {`/`} | U+002F | solidus (forward slash) |
939+
| {`b`} | U+0008 | backspace |
940+
| {`f`} | U+000C | form feed |
941+
| {`n`} | U+000A | line feed (new line) |
942+
| {`r`} | U+000D | carriage return |
943+
| {`t`} | U+0009 | horizontal tab |
921944

922945
StringValue :: `"""` BlockStringCharacter* `"""`
923946

0 commit comments

Comments
 (0)