Skip to content

Commit bae929f

Browse files
committed
RFC: Allow full unicode range
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
1 parent fdeb37d commit bae929f

File tree

2 files changed

+35
-7
lines changed

2 files changed

+35
-7
lines changed

spec/Appendix B -- Grammar Summary.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ SourceCharacter ::
66
- "U+0009"
77
- "U+000A"
88
- "U+000D"
9-
- "U+0020–U+FFFF"
9+
- "U+0020–U+10FFFF"
1010

1111

1212
## Ignored Tokens
@@ -101,7 +101,14 @@ StringCharacter ::
101101
- `\u` EscapedUnicode
102102
- `\` EscapedCharacter
103103

104-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
104+
EscapedUnicode ::
105+
- HexDigit HexDigit HexDigit HexDigit
106+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
107+
108+
HexDigit :: one of
109+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
110+
- `A` `B` `C` `D` `E` `F`
111+
- `a` `b` `c` `d` `e` `f`
105112

106113
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
107114

spec/Section 2 -- Language.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ SourceCharacter ::
5050
- "U+0009"
5151
- "U+000A"
5252
- "U+000D"
53-
- "U+0020–U+FFFF"
53+
- "U+0020–U+10FFFF"
5454

5555
GraphQL documents are expressed as a sequence of
5656
[Unicode](https://unicode.org/standard/standard.html) code points (informally
@@ -815,7 +815,14 @@ StringCharacter ::
815815
- `\u` EscapedUnicode
816816
- `\` EscapedCharacter
817817

818-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
818+
EscapedUnicode ::
819+
- HexDigit HexDigit HexDigit HexDigit
820+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
821+
822+
HexDigit :: one of
823+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
824+
- `A` `B` `C` `D` `E` `F`
825+
- `a` `b` `c` `d` `e` `f`
819826

820827
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
821828

@@ -899,16 +906,30 @@ StringValue :: `""`
899906

900907
StringValue :: `"` StringCharacter+ `"`
901908

902-
* Return the sequence of all {StringCharacter} code points.
909+
* Let {string} be the sequence of all {StringCharacter} code points.
910+
* For each {codePoint} at {index} in {string}:
911+
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
912+
* Let {lowPoint} be the code point at {index} + {1} in {string}.
913+
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
914+
* Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
915+
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
916+
* Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
917+
* Return {string}.
918+
919+
Note: {StringValue} should avoid encoding code points as surrogate pairs.
920+
While services must interpret them accordingly, a braced escape (for example
921+
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
922+
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
903923

904924
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
905925

906926
* Return the code point {SourceCharacter}.
907927

908928
StringCharacter :: `\u` EscapedUnicode
909929

910-
* Let {value} be the 16-bit hexadecimal value represented by the sequence of
911-
hexadecimal digits within {EscapedUnicode}.
930+
* Let {value} be the 21-bit hexadecimal value represented by the sequence of
931+
{HexDigit} within {EscapedUnicode}.
932+
* Assert {value} <= 0x10FFFF.
912933
* Return the code point {value}.
913934

914935
StringCharacter :: `\` EscapedCharacter

0 commit comments

Comments
 (0)