Skip to content

Commit b45b7de

Browse files
leebyronandimarek
andcommitted
Revised RFC after feedback
Co-authored-by: Andreas Marek <andimarek@fastmail.fm>
1 parent bae929f commit b45b7de

File tree

5 files changed

+127
-61
lines changed

5 files changed

+127
-61
lines changed

package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@
1414
},
1515
"scripts": {
1616
"test": "npm run test:build && npm run test:spellcheck",
17-
"test:build": "spec-md spec/GraphQL.md > /dev/null",
17+
"test:build": "spec-md --metadata spec/metadata.json spec/GraphQL.md > /dev/null",
1818
"test:spellcheck": "cspell 'spec/**/*.md' README.md",
19-
"build": "mkdir -p out; spec-md --githubSource 'https://github.com/graphql/graphql-spec/blame/main/' spec/GraphQL.md > out/index.html",
19+
"build": "mkdir -p out; spec-md --metadata spec/metadata.json --githubSource 'https://github.com/graphql/graphql-spec/blame/main/' spec/GraphQL.md > out/index.html",
2020
"watch": "nodemon -e json,md --exec 'npm run build'"
2121
},
2222
"devDependencies": {

publish.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ GITTAG=$(git tag --points-at HEAD)
88
echo "Building spec"
99
mkdir -p out
1010
if [ -n "$GITTAG" ]; then
11-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > out/index.html
11+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > out/index.html
1212
else
13-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > out/index.html
13+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > out/index.html
1414
fi
1515
npm run build > /dev/null 2>&1
1616

spec/Appendix B -- Grammar Summary.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,7 @@
33
## Source Text
44

55
SourceCharacter ::
6-
- "U+0009"
7-
- "U+000A"
8-
- "U+000D"
9-
- "U+0020–U+10FFFF"
6+
- "Any Unicode scalar value"
107

118

129
## Ignored Tokens
@@ -102,8 +99,8 @@ StringCharacter ::
10299
- `\` EscapedCharacter
103100

104101
EscapedUnicode ::
102+
- `{` HexDigit+ `}`
105103
- HexDigit HexDigit HexDigit HexDigit
106-
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
107104

108105
HexDigit :: one of
109106
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`

spec/Section 2 -- Language.md

Lines changed: 106 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -47,31 +47,22 @@ match, however some lookahead restrictions include additional constraints.
4747
## Source Text
4848

4949
SourceCharacter ::
50-
- "U+0009"
51-
- "U+000A"
52-
- "U+000D"
53-
- "U+0020–U+10FFFF"
50+
- "Any Unicode scalar value"
5451

55-
GraphQL documents are expressed as a sequence of
56-
[Unicode](https://unicode.org/standard/standard.html) code points (informally
57-
referred to as *"characters"* through most of this specification). However, with
58-
few exceptions, most of GraphQL is expressed only in the original non-control
59-
ASCII range so as to be as widely compatible with as many existing tools,
60-
languages, and serialization formats as possible and avoid display issues in
61-
text editors and source control.
52+
GraphQL documents are interpreted from a source text, which is a sequence of
53+
{SourceCharacter}, each {SourceCharacter} being a *Unicode scalar value* which
54+
may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
55+
(informally referred to as *"characters"* through most of this specification).
6256

63-
Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64-
{Comment} portions of GraphQL.
57+
A GraphQL document may be expressed only in the ASCII range to be as widely
58+
compatible with as many existing tools, languages, and serialization formats as
59+
possible and avoid display issues in text editors and source control. Non-ASCII
60+
Unicode scalar values may appear within {StringValue} and {Comment}.
6561

66-
67-
### Unicode
68-
69-
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
70-
71-
The "Byte Order Mark" is a special Unicode character which
72-
may appear at the beginning of a file containing Unicode which programs may use
73-
to determine the fact that the text stream is Unicode, what endianness the text
74-
stream is in, and which of several Unicode encodings to interpret.
62+
Note: An implementation which uses *UTF-16* to represent GraphQL documents in
63+
memory (for example, JavaScript or Java) may encounter a *surrogate pair*. This
64+
encodes a *supplementary code point* and is a single valid source character,
65+
however an unpaired *surrogate code point* is not a valid source character.
7566

7667

7768
### White Space
@@ -178,6 +169,16 @@ significant way, for example a {StringValue} may contain white space characters.
178169
No {Ignored} may appear *within* a {Token}, for example no white space
179170
characters are permitted between the characters defining a {FloatValue}.
180171

172+
**Byte order mark**
173+
174+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
175+
176+
The *Byte Order Mark* is a special Unicode code point which may appear at the
177+
beginning of a file which programs may use to determine the fact that the text
178+
stream is Unicode, and what specific encoding has been used.
179+
180+
As files are often concatenated, a *Byte Order Mark* may appear anywhere within
181+
a GraphQL document and is {Ignored}.
181182

182183
### Punctuators
183184

@@ -816,8 +817,8 @@ StringCharacter ::
816817
- `\` EscapedCharacter
817818

818819
EscapedUnicode ::
820+
- `{` HexDigit+ `}`
819821
- HexDigit HexDigit HexDigit HexDigit
820-
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
821822

822823
HexDigit :: one of
823824
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
@@ -830,19 +831,58 @@ BlockStringCharacter ::
830831
- SourceCharacter but not `"""` or `\"""`
831832
- `\"""`
832833

833-
Strings are sequences of characters wrapped in quotation marks (U+0022).
834-
(ex. {`"Hello World"`}). White space and other otherwise-ignored characters are
835-
significant within a string value.
834+
{StringValue} is a sequence of characters wrapped in quotation marks (U+0022).
835+
(ex. {`"Hello World"`}). White space and other characters ignored in other parts
836+
of a GraphQL document are significant within a string value.
837+
838+
A {StringValue} is evaluated to a Unicode text value, a sequence of Unicode
839+
scalar values, by interpreting all escape sequences using the static semantics
840+
defined below.
836841

837842
The empty string {`""`} must not be followed by another {`"`} otherwise it would
838843
be interpreted as the beginning of a block string. As an example, the source
839844
{`""""""`} can only be interpreted as a single empty block string and not three
840845
empty strings.
841846

842-
Non-ASCII Unicode characters are allowed within single-quoted strings.
843-
Since {SourceCharacter} must not contain some ASCII control characters, escape
844-
sequences must be used to represent these characters. The {`\`}, {`"`}
845-
characters also must be escaped. All other escape sequences are optional.
847+
**Escape Sequences**
848+
849+
In a single-quoted {StringValue}, any Unicode scalar value may be expressed
850+
using an escape sequence. GraphQL strings allow both C-style escape sequences
851+
(for example `\n`) and two forms of Unicode escape sequences: one with a
852+
fixed-width of 4 hexadecimal digits (for example `\u000A`) and one with a
853+
variable-width most useful for representing a *supplementary character* such as
854+
an Emoji (for example `\u{1F4A9}`).
855+
856+
The hexadecimal number encoded by a Unicode escape sequence must describe a
857+
Unicode scalar value, otherwise parsing should stop with an early error. For
858+
example both sources `"\uDEAD"` and `"\u{110000}"` should not be considered
859+
valid {StringValue}.
860+
861+
Escape sequences are only meaningful within a single-quoted string. Within a
862+
block string, they are simply that sequence of characters (for example
863+
`"""\n"""` represents the Unicode text [U+005C, U+006E]). Within a comment an
864+
escape sequence is not a significant sequence of characters. They may not appear
865+
elsewhere in a GraphQL document.
866+
867+
Since {StringCharacter} must not contain some characters, escape sequences must
868+
be used to represent these characters. All other escape sequences are optional
869+
and unescaped non-ASCII Unicode characters are allowed within strings. If using
870+
GraphQL within a system which only supports ASCII, then escape sequences may be
871+
used to represent all Unicode characters outside of the ASCII range.
872+
873+
For legacy reasons, a *supplementary character* may be escaped by two
874+
fixed-width unicode escape sequences forming a *surrogate pair*. For example
875+
the input `"\uD83D\uDCA9"` is a valid {StringValue} which represents the same
876+
Unicode text as `"\u{1F4A9}"`. While this legacy form is allowed, it should be
877+
avoided as a variable-width unicode escape sequence is a clearer way to encode
878+
such code points.
879+
880+
When producing a {StringValue}, implementations should use escape sequences to
881+
represent non-printable control characters (U+0000 to U+001F and U+007F to
882+
U+009F). Other escape sequences are not necessary, however an implementation may
883+
use escape sequences to represent any other range of code points. If an
884+
implementation chooses to escape a *supplementary character*, it should not use
885+
a fixed-width surrogate pair unicode escape sequence.
846886

847887
**Block Strings**
848888

@@ -898,40 +938,55 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
898938
quoted string with appropriate escape sequences must be used instead of a
899939
block string.
900940

901-
**Semantics**
941+
**Static Semantics**
942+
943+
A {StringValue} describes a Unicode text value, a sequence of *Unicode scalar
944+
value*s. These semantics describe how to apply the {StringValue} grammar to a
945+
source text to evaluate a Unicode text. Errors encountered during this
946+
evaluation are considered a failure to apply the {StringValue} grammar to a
947+
source and result in a parsing error.
902948

903949
StringValue :: `""`
904950

905951
* Return an empty sequence.
906952

907953
StringValue :: `"` StringCharacter+ `"`
908954

909-
* Let {string} be the sequence of all {StringCharacter} code points.
910-
* For each {codePoint} at {index} in {string}:
911-
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
912-
* Let {lowPoint} be the code point at {index} + {1} in {string}.
913-
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
914-
* Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
915-
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
916-
* Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
917-
* Return {string}.
918-
919-
Note: {StringValue} should avoid encoding code points as surrogate pairs.
920-
While services must interpret them accordingly, a braced escape (for example
921-
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
922-
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
955+
* Return the concatenated sequence of *Unicode scalar value* by evaluating all
956+
{StringCharacter}.
923957

924958
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
925959

926-
* Return the code point {SourceCharacter}.
960+
* Return the *Unicode scalar value* {SourceCharacter}.
927961

928962
StringCharacter :: `\u` EscapedUnicode
929963

930-
* Let {value} be the 21-bit hexadecimal value represented by the sequence of
964+
* Let {value} be the hexadecimal value represented by the sequence of
931965
{HexDigit} within {EscapedUnicode}.
932-
* Assert {value} <= 0x10FFFF.
966+
* Assert {value} is a within the *Unicode scalar value* range (>= 0x0000 and
967+
<= 0xD7FF or >= 0xE000 and <= 0x10FFFF).
933968
* Return the code point {value}.
934969

970+
StringCharacter :: `\u` HexDigit HexDigit HexDigit HexDigit `\u` HexDigit HexDigit HexDigit HexDigit
971+
972+
* Let {leadingValue} be the hexadecimal value represented by the first
973+
sequence of {HexDigit}.
974+
* Let {trailingValue} be the hexadecimal value represented by the second
975+
sequence of {HexDigit}.
976+
* If {leadingValue} is >= 0xD800 and <= 0xDBFF (a *Leading Surrogate*):
977+
* Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a *Trailing Surrogate*).
978+
* Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) + 0x10000.
979+
* Otherwise:
980+
* Assert {leadingValue} is within the *Unicode scalar value* range.
981+
* Assert {trailingValue} is within the *Unicode scalar value* range.
982+
* Return the sequence of the code point {leadingValue} followed by the code
983+
point {trailingValue}.
984+
985+
Note: If both escape sequences encode a *Unicode scalar value*, then this
986+
semantic is identical to applying the prior semantic on each fixed-width escape
987+
sequence. A variable-width escape sequence must only encode a
988+
*Unicode scalar value*.
989+
935990
StringCharacter :: `\` EscapedCharacter
936991

937992
* Return the code point represented by {EscapedCharacter} according to the
@@ -950,14 +1005,13 @@ StringCharacter :: `\` EscapedCharacter
9501005

9511006
StringValue :: `"""` BlockStringCharacter* `"""`
9521007

953-
* Let {rawValue} be the Unicode character sequence of all
954-
{BlockStringCharacter} Unicode character values (which may be an empty
955-
sequence).
1008+
* Let {rawValue} be the concatenated sequence of *Unicode scalar value* by
1009+
evaluating all {BlockStringCharacter} (which may be an empty sequence).
9561010
* Return the result of {BlockStringValue(rawValue)}.
9571011

9581012
BlockStringCharacter :: SourceCharacter but not `"""` or `\"""`
9591013

960-
* Return the character value of {SourceCharacter}.
1014+
* Return the *Unicode scalar value* {SourceCharacter}.
9611015

9621016
BlockStringCharacter :: `\"""`
9631017

spec/metadata.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"biblio": {
3+
"https://www.unicode.org/glossary": {
4+
"byte-order-mark": "#byte_order_mark",
5+
"leading-surrogate": "#leading_surrogate",
6+
"trailing-surrogate": "#trailing_surrogate",
7+
"supplementary-character": "#supplementary_character",
8+
"supplementary-code-point": "#supplementary_code_point",
9+
"surrogate-code-point": "#surrogate_code_point",
10+
"surrogate-pair": "#surrogate_pair",
11+
"unicode-scalar-value": "#unicode_scalar_value",
12+
"utf-16": "#UTF_16"
13+
}
14+
}
15+
}

0 commit comments

Comments
 (0)