Skip to content

[RFC] Support full Unicode character range #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
SourceCharacter :: "Any Unicode code point"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that XML 1.0 forbids much of the same C0 control characters the previous definition did:

https://www.w3.org/TR/xml/#charsets

    Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

so if we want to maintain compatibility with XML 1.0 transport, we might want to keep that restriction so we don't mess up existing clients.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that allowing U+0000 (NUL) will be interesting for a number of implementations as well as security checks, since C string APIs will assume that NUL is the end of a string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great reference, thanks for pointing that out. I had similar concern about allowing U+0 (see my 2nd comment on this issue) so happy to see that concern raised elsewhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL handling is hardcore thing :-)
By my opinion it shall be best to forbid unescaped NULL character in source AND should be explicitly declared that it must be rejected by tokenizer as an error anywhere in source.
In each reliable implementations should be test case for this case.



## Ignored Tokens
Expand Down
60 changes: 42 additions & 18 deletions spec/Section 2 -- Language.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,28 +13,44 @@ double-colon `::`).

## Source Text

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
SourceCharacter :: "Any Unicode code point"

GraphQL documents are expressed as a sequence of
[Unicode](http://unicode.org/standard/standard.html) characters. However, with
few exceptions, most of GraphQL is expressed only in the original non-control
ASCII range so as to be as widely compatible with as many existing tools,
languages, and serialization formats as possible and avoid display issues in
text editors and source control.
[Unicode](http://unicode.org/standard/standard.html) code points (referred to in
this specification as characters). All Unicode code point values from U+0000 to
U+10FFFF, including surrogate code points, may appear in this sequence where
allowed by the grammatical rules below.
Copy link
Contributor

@chris-morgan chris-morgan Oct 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“including surrogate code points”: I’m not sure what you mean by this. Do you mean that you will allow things like U+DEAD? This is problematic: UTF-8 does not allow surrogates, and a language that works with UTF-8 strings will then break. Rust, for example. (And you can’t express the document in legal UTF-8 then, either.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hope is to both be as non-restrictive as possible while being compatible with as many unicode encodings as possible. There's a note below that the encoding should be irrelevant.

Any given unicode encoding can represent some form of a stream of units between U+0000 and U+10FFFF. The fact that UTF-8 cannot use UTF-16 surrogate pairs, or that UTF-8 can encode invalid points that cannot be encoded in UTF-16 should be irrelevant.

Explicitly, we don't want to make any promises that the GraphQL language will do some specific unicode-specific operations that would be an undesirable burden for implementors who need to use existing languages with existing Unicode quirks. Hence the clauses above and below that GraphQL won't combine surrogate pairs for you nor assemble combining sequences before tokenizing.

If there's a better way to word this that's more clear, I'm definitely open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the encoding is relevant; if you allow the document to include the codepoint U+DEAD, it cannot be expressed in UTF-8. Or UTF-16, when I think about it properly; 0xDEAD does not stand alone; it’s half of the nasty mess that is encoding astral plane characters. U+DEAD cannot be expressed in UTF-16. (That’s why that whole area is blocked off.)

I would strongly advise that surrogates (codepoints like U+DEAD) be disallowed. This doesn’t mean you can’t express things like U+10000 in UTF-16—the surrogate pair there is an aspect of the encoding, and so there is no surrogate in the actual decoded string. There may be a \xDE\xAD there in this hypothetical UTF-16BE, but there’s not a U+DEAD there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.

As such, after decoding, there are a number of code points which just cannot appear, including surrogates.

You shouldn't even include that language. I would just remove it.


A [combining character sequence](http://unicode.org/faq/char_combmark.html) is
treated as a sequence of individual Unicode code points and a sequence of
individual {SourceCharacter}, even though they may appear to a user as a
single character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this introduces the potential to seriously mess with some syntax highlighters by combining characters with quotation marks and things like that. Ah well. Nothing new here, that’s how everything treats such things.

BTW, is there supposed to be a paragraph break before this sentence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to add one


### Unicode
However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see what the point of this paragraph is. Either GraphQL source documents require non-ASCII values to be escaped, or they don't. I would remove it.

expressed only in the original non-control ASCII range so as to be as widely
compatible with as many existing tools, languages, and serialization formats as
possible and avoid display issues in text editors and source control.

Note: The encoding used to represent a GraphQL document source is irrelevant to
this specification. A document is not required to be stored or transmitted in an
encoding which can represent every Unicode code point. Instead, given any
encoding format, and the range of code points which it can encode, GraphQL
documents may consist of any of those code points.

UnicodeBOM :: "Byte Order Mark (U+FEFF)"

Non-ASCII Unicode characters may freely appear within {StringValue} and
{Comment} portions of GraphQL.
### Byte Order Mark

UnicodeBOM :: "Byte Order Mark (U+FEFF)"
Copy link

@bhamiltoncx bhamiltoncx Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like surrogates, this code point is irrelevant after decoding the encoded document to Unicode. It won't appear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm borrowing this idea from ECMAScript spec where the rationale was based on reading crappy files resulting from concatenating utf-16 files, and the parser breaking. This code point is irrelevant after decoding if it's the first point in the sequence, but IIUC Unicode doesn't do anything about BOM found within a sequence?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the decoder to properly decode files before passing them to GraphQL, including decoding before concatenating.


The "Byte Order Mark" is a special Unicode character which
may appear at the beginning of a file containing Unicode which programs may use
to determine the fact that the text stream is Unicode, what endianness the text
stream is in, and which of several Unicode encodings to interpret.

GraphQL ignores this character anywhere ignored tokens may occur, regardless of
if it appears at the beginning of a GraphQL document, as it may appear within a
document due to file concatenation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you're going to keep this sentence, suggest "it may appear elsewhere within" rather than "it may appear within".


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in the middle of a string or a token?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify for those cases


### White Space

Expand Down Expand Up @@ -65,7 +81,11 @@ text, any amount may appear before or after any other token and have no
significance to the semantic meaning of a GraphQL query document. Line
terminators are not found within any other token.

Note: Any error reporting which provide the line number in the source of the
Note: GraphQL intentionally does not consider Unicode line or paragraph
separators outside the ASCII range as line terminators, avoiding
misinterpretation by text editors and source control tools.

Note: Any error reporting which provides the line number in the source of the
offending syntax should use the preceding amount of {LineTerminator} to produce
the line number.

Expand All @@ -83,10 +103,14 @@ A comment can contain any Unicode code point except {LineTerminator} so a
comment always consists of all code points starting with the {`#`} character up
to but not including the line terminator.

Comments behave like white space and may appear after any token, or before a
Comments behave like white space and may appear after any token, or before any
line terminator, and have no significance to the semantic meaning of a GraphQL
query document.

Any Unicode code point may appear within a Comment. Comments do not include
escape sequences, so the character sequence `\n` or `\u000A` must not be
interpreted as the end of a Comment.


### Insignificant Commas

Expand Down Expand Up @@ -704,13 +728,13 @@ EscapedUnicode :: /[0-9A-Fa-f]{4}/

EscapedCharacter :: one of `"` \ `/` b f n r t

Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
`"Hello World"`). White space and other otherwise-ignored characters are
significant within a string value.
Strings are sequences of zero or more source characters wrapped in double-quotes
(`"`). (ex. `"Hello World"`).

Note: Unicode characters are allowed within String value literals, however
GraphQL source must not contain some ASCII control characters so escape
sequences must be used to represent these characters.
Any Unicode code point other than those explicitly excluded may appear literally
within a String value. White-space and other characters otherwise ignored
outside of string values are significant and included. Unicode code points may
also be represented with escape sequences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… where escape sequences are? (Looks like \u EscapedUnicode still isn’t explained precisely or clearly?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're explained in the Semantics section right below here. I'm open to suggestions for improvements to that if you feel there's something specific that is not precise or clear enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I’m happy with that part now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode, but modern languages support SMP code points via a \u{Hex} syntax which supports between 4 and 6 hexadecimal digits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: between one and six hexadecimal digits (\u{a} is quite legal).


**Semantics**

Expand Down