You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
referred to as *"characters"* through most of this specification). However, with
58
-
few exceptions, most of GraphQL is expressed only in the original non-control
59
-
ASCII range so as to be as widely compatible with as many existing tools,
60
-
languages, and serialization formats as possible and avoid display issues in
61
-
text editors and source control.
52
+
GraphQL documents are interpreted from a source text, which is a sequence of
53
+
{SourceCharacter}, each {SourceCharacter} being a *Unicode scalar value* which
54
+
may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
55
+
(informally referred to as *"characters"* through most of this specification).
62
56
63
-
Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64
-
{Comment} portions of GraphQL.
57
+
A GraphQL document may be expressed only in the ASCII range to be as widely
58
+
compatible with as many existing tools, languages, and serialization formats as
59
+
possible and avoid display issues in text editors and source control. Non-ASCII
60
+
Unicode scalar values may appear within {StringValue} and {Comment}.
65
61
66
-
67
-
### Unicode
68
-
69
-
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
70
-
71
-
The "Byte Order Mark" is a special Unicode character which
72
-
may appear at the beginning of a file containing Unicode which programs may use
73
-
to determine the fact that the text stream is Unicode, what endianness the text
74
-
stream is in, and which of several Unicode encodings to interpret.
62
+
Note: An implementation which uses *UTF-16* to represent GraphQL documents in
63
+
memory (for example, JavaScript or Java) may encounter a *surrogate pair*. This
64
+
encodes a *supplementary code point* and is a single valid source character,
65
+
however an unpaired *surrogate code point* is not a valid source character.
75
66
76
67
77
68
### White Space
@@ -178,6 +169,16 @@ significant way, for example a {StringValue} may contain white space characters.
178
169
No {Ignored} may appear *within* a {Token}, for example no white space
179
170
characters are permitted between the characters defining a {FloatValue}.
180
171
172
+
**Byte order mark**
173
+
174
+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
175
+
176
+
The *Byte Order Mark* is a special Unicode code point which may appear at the
177
+
beginning of a file which programs may use to determine the fact that the text
178
+
stream is Unicode, and what specific encoding has been used.
179
+
180
+
As files are often concatenated, a *Byte Order Mark* may appear anywhere within
181
+
a GraphQL document and is {Ignored}.
181
182
182
183
### Punctuators
183
184
@@ -816,8 +817,8 @@ StringCharacter ::
816
817
-`\` EscapedCharacter
817
818
818
819
EscapedUnicode ::
820
+
-`{` HexDigit+ `}`
819
821
- HexDigit HexDigit HexDigit HexDigit
820
-
-`{` HexDigit+ `}` "but only if <= 0x10FFFF"
821
822
822
823
HexDigit :: one of
823
824
-`0``1``2``3``4``5``6``7``8``9`
@@ -830,19 +831,58 @@ BlockStringCharacter ::
830
831
- SourceCharacter but not `"""` or `\"""`
831
832
-`\"""`
832
833
833
-
Strings are sequences of characters wrapped in quotation marks (U+0022).
834
-
(ex. {`"Hello World"`}). White space and other otherwise-ignored characters are
835
-
significant within a string value.
834
+
{StringValue} is a sequence of characters wrapped in quotation marks (U+0022).
835
+
(ex. {`"Hello World"`}). White space and other characters ignored in other parts
836
+
of a GraphQL document are significant within a string value.
837
+
838
+
A {StringValue} is evaluated to a Unicode text value, a sequence of Unicode
839
+
scalar values, by interpreting all escape sequences using the static semantics
840
+
defined below.
836
841
837
842
The empty string {`""`} must not be followed by another {`"`} otherwise it would
838
843
be interpreted as the beginning of a block string. As an example, the source
839
844
{`""""""`} can only be interpreted as a single empty block string and not three
840
845
empty strings.
841
846
842
-
Non-ASCII Unicode characters are allowed within single-quoted strings.
843
-
Since {SourceCharacter} must not contain some ASCII control characters, escape
844
-
sequences must be used to represent these characters. The {`\`}, {`"`}
845
-
characters also must be escaped. All other escape sequences are optional.
847
+
**Escape Sequences**
848
+
849
+
In a single-quoted {StringValue}, any Unicode scalar value may be expressed
850
+
using an escape sequence. GraphQL strings allow both C-style escape sequences
851
+
(for example `\n`) and two forms of Unicode escape sequences: one with a
852
+
fixed-width of 4 hexadecimal digits (for example `\u000A`) and one with a
853
+
variable-width most useful for representing a *supplementary character* such as
854
+
an Emoji (for example `\u{1F4A9}`).
855
+
856
+
The hexadecimal number encoded by a Unicode escape sequence must describe a
857
+
Unicode scalar value, otherwise parsing should stop with an early error. For
858
+
example both sources `"\uDEAD"` and `"\u{110000}"` should not be considered
859
+
valid {StringValue}.
860
+
861
+
Escape sequences are only meaningful within a single-quoted string. Within a
862
+
block string, they are simply that sequence of characters (for example
863
+
`"""\n"""` represents the Unicode text [U+005C, U+006E]). Within a comment an
864
+
escape sequence is not a significant sequence of characters. They may not appear
865
+
elsewhere in a GraphQL document.
866
+
867
+
Since {StringCharacter} must not contain some characters, escape sequences must
868
+
be used to represent these characters. All other escape sequences are optional
869
+
and unescaped non-ASCII Unicode characters are allowed within strings. If using
870
+
GraphQL within a system which only supports ASCII, then escape sequences may be
871
+
used to represent all Unicode characters outside of the ASCII range.
872
+
873
+
For legacy reasons, a *supplementary character* may be escaped by two
874
+
fixed-width unicode escape sequences forming a *surrogate pair*. For example
875
+
the input `"\uD83D\uDCA9"` is a valid {StringValue} which represents the same
876
+
Unicode text as `"\u{1F4A9}"`. While this legacy form is allowed, it should be
877
+
avoided as a variable-width unicode escape sequence is a clearer way to encode
878
+
such code points.
879
+
880
+
When producing a {StringValue}, implementations should use escape sequences to
881
+
represent non-printable control characters (U+0000 to U+001F and U+007F to
882
+
U+009F). Other escape sequences are not necessary, however an implementation may
883
+
use escape sequences to represent any other range of code points. If an
884
+
implementation chooses to escape a *supplementary character*, it should not use
885
+
a fixed-width surrogate pair unicode escape sequence.
846
886
847
887
**Block Strings**
848
888
@@ -898,40 +938,55 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
898
938
quoted string with appropriate escape sequences must be used instead of a
899
939
block string.
900
940
901
-
**Semantics**
941
+
**Static Semantics**
942
+
943
+
A {StringValue} describes a Unicode text value, a sequence of *Unicode scalar
944
+
value*s. These semantics describe how to apply the {StringValue} grammar to a
945
+
source text to evaluate a Unicode text. Errors encountered during this
946
+
evaluation are considered a failure to apply the {StringValue} grammar to a
947
+
source and result in a parsing error.
902
948
903
949
StringValue :: `""`
904
950
905
951
* Return an empty sequence.
906
952
907
953
StringValue :: `"` StringCharacter+ `"`
908
954
909
-
* Let {string} be the sequence of all {StringCharacter} code points.
910
-
* For each {codePoint} at {index} in {string}:
911
-
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
912
-
* Let {lowPoint} be the code point at {index} + {1} in {string}.
913
-
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
916
-
* Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
917
-
* Return {string}.
918
-
919
-
Note: {StringValue} should avoid encoding code points as surrogate pairs.
920
-
While services must interpret them accordingly, a braced escape (for example
921
-
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
0 commit comments