RFC: Allow full unicode range

leebyron · leebyron · commit bae929fe29da · 2021-05-27T12:54:46.000-07:00
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
@@ -6,7 +6,7 @@ SourceCharacter ::
   - "U+0009"
   - "U+000A"
   - "U+000D"
-  - "U+0020–U+FFFF"
+  - "U+0020–U+10FFFF"
 
 
 ## Ignored Tokens
@@ -101,7 +101,14 @@ StringCharacter ::
   - `\u` EscapedUnicode
   - `\` EscapedCharacter
 
-EscapedUnicode :: /[0-9A-Fa-f]{4}/
+EscapedUnicode ::
+  - HexDigit HexDigit HexDigit HexDigit
+  - `{` HexDigit+ `}` "but only if <= 0x10FFFF"
+
+HexDigit :: one of
+  - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+  - `A` `B` `C` `D` `E` `F`
+  - `a` `b` `c` `d` `e` `f`
 
 EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
 
diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md
@@ -50,7 +50,7 @@ SourceCharacter ::
   - "U+0009"
   - "U+000A"
   - "U+000D"
-  - "U+0020–U+FFFF"
+  - "U+0020–U+10FFFF"
 
 GraphQL documents are expressed as a sequence of
 [Unicode](https://unicode.org/standard/standard.html) code points (informally
@@ -815,7 +815,14 @@ StringCharacter ::
   - `\u` EscapedUnicode
   - `\` EscapedCharacter
 
-EscapedUnicode :: /[0-9A-Fa-f]{4}/
+EscapedUnicode ::
+  - HexDigit HexDigit HexDigit HexDigit
+  - `{` HexDigit+ `}` "but only if <= 0x10FFFF"
+
+HexDigit :: one of
+  - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+  - `A` `B` `C` `D` `E` `F`
+  - `a` `b` `c` `d` `e` `f`
 
 EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
 
@@ -899,16 +906,30 @@ StringValue :: `""`
 
 StringValue :: `"` StringCharacter+ `"`
 
-  * Return the sequence of all {StringCharacter} code points.
+  * Let {string} be the sequence of all {StringCharacter} code points.
+  * For each {codePoint} at {index} in {string}:
+    * If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
+      * Let {lowPoint} be the code point at {index} + {1} in {string}.
+      * Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
+      * Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
+      * Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
+    * Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
+  * Return {string}.
+
+Note: {StringValue} should avoid encoding code points as surrogate pairs.
+While services must interpret them accordingly, a braced escape (for example
+`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
+[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
 
 StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
 
   * Return the code point {SourceCharacter}.
 
 StringCharacter :: `\u` EscapedUnicode
 
-  * Let {value} be the 16-bit hexadecimal value represented by the sequence of
-    hexadecimal digits within {EscapedUnicode}.
+  * Let {value} be the 21-bit hexadecimal value represented by the sequence of
+    {HexDigit} within {EscapedUnicode}.
+  * Assert {value} <= 0x10FFFF.
   * Return the code point {value}.
 
 StringCharacter :: `\` EscapedCharacter