-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[RFC] Support full Unicode character range #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,28 +13,44 @@ double-colon `::`). | |
|
||
## Source Text | ||
|
||
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ | ||
SourceCharacter :: "Any Unicode code point" | ||
|
||
GraphQL documents are expressed as a sequence of | ||
[Unicode](http://unicode.org/standard/standard.html) characters. However, with | ||
few exceptions, most of GraphQL is expressed only in the original non-control | ||
ASCII range so as to be as widely compatible with as many existing tools, | ||
languages, and serialization formats as possible and avoid display issues in | ||
text editors and source control. | ||
[Unicode](http://unicode.org/standard/standard.html) code points (referred to in | ||
this specification as characters). All Unicode code point values from U+0000 to | ||
U+10FFFF, including surrogate code points, may appear in this sequence where | ||
allowed by the grammatical rules below. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. “including surrogate code points”: I’m not sure what you mean by this. Do you mean that you will allow things like U+DEAD? This is problematic: UTF-8 does not allow surrogates, and a language that works with UTF-8 strings will then break. Rust, for example. (And you can’t express the document in legal UTF-8 then, either.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The hope is to both be as non-restrictive as possible while being compatible with as many unicode encodings as possible. There's a note below that the encoding should be irrelevant. Any given unicode encoding can represent some form of a stream of units between U+0000 and U+10FFFF. The fact that UTF-8 cannot use UTF-16 surrogate pairs, or that UTF-8 can encode invalid points that cannot be encoded in UTF-16 should be irrelevant. Explicitly, we don't want to make any promises that the GraphQL language will do some specific unicode-specific operations that would be an undesirable burden for implementors who need to use existing languages with existing Unicode quirks. Hence the clauses above and below that GraphQL won't combine surrogate pairs for you nor assemble combining sequences before tokenizing. If there's a better way to word this that's more clear, I'm definitely open to suggestions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that the encoding is relevant; if you allow the document to include the codepoint U+DEAD, it cannot be expressed in UTF-8. Or UTF-16, when I think about it properly; 0xDEAD does not stand alone; it’s half of the nasty mess that is encoding astral plane characters. U+DEAD cannot be expressed in UTF-16. (That’s why that whole area is blocked off.) I would strongly advise that surrogates (codepoints like U+DEAD) be disallowed. This doesn’t mean you can’t express things like U+10000 in UTF-16—the surrogate pair there is an aspect of the encoding, and so there is no surrogate in the actual decoded string. There may be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode. As such, after decoding, there are a number of code points which just cannot appear, including surrogates. You shouldn't even include that language. I would just remove it. |
||
|
||
A [combining character sequence](http://unicode.org/faq/char_combmark.html) is | ||
treated as a sequence of individual Unicode code points and a sequence of | ||
individual {SourceCharacter}, even though they may appear to a user as a | ||
single character. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And this introduces the potential to seriously mess with some syntax highlighters by combining characters with quotation marks and things like that. Ah well. Nothing new here, that’s how everything treats such things. BTW, is there supposed to be a paragraph break before this sentence? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good suggestion to add one |
||
|
||
### Unicode | ||
However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really see what the point of this paragraph is. Either GraphQL source documents require non-ASCII values to be escaped, or they don't. I would remove it. |
||
expressed only in the original non-control ASCII range so as to be as widely | ||
compatible with as many existing tools, languages, and serialization formats as | ||
possible and avoid display issues in text editors and source control. | ||
|
||
Note: The encoding used to represent a GraphQL document source is irrelevant to | ||
this specification. A document is not required to be stored or transmitted in an | ||
encoding which can represent every Unicode code point. Instead, given any | ||
encoding format, and the range of code points which it can encode, GraphQL | ||
documents may consist of any of those code points. | ||
|
||
UnicodeBOM :: "Byte Order Mark (U+FEFF)" | ||
|
||
Non-ASCII Unicode characters may freely appear within {StringValue} and | ||
{Comment} portions of GraphQL. | ||
### Byte Order Mark | ||
|
||
UnicodeBOM :: "Byte Order Mark (U+FEFF)" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like surrogates, this code point is irrelevant after decoding the encoded document to Unicode. It won't appear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm borrowing this idea from ECMAScript spec where the rationale was based on reading crappy files resulting from concatenating utf-16 files, and the parser breaking. This code point is irrelevant after decoding if it's the first point in the sequence, but IIUC Unicode doesn't do anything about BOM found within a sequence? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's up to the decoder to properly decode files before passing them to GraphQL, including decoding before concatenating. |
||
|
||
The "Byte Order Mark" is a special Unicode character which | ||
may appear at the beginning of a file containing Unicode which programs may use | ||
to determine the fact that the text stream is Unicode, what endianness the text | ||
stream is in, and which of several Unicode encodings to interpret. | ||
|
||
GraphQL ignores this character anywhere ignored tokens may occur, regardless of | ||
if it appears at the beginning of a GraphQL document, as it may appear within a | ||
document due to file concatenation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: if you're going to keep this sentence, suggest "it may appear elsewhere within" rather than "it may appear within". |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about in the middle of a string or a token? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll clarify for those cases |
||
|
||
### White Space | ||
|
||
|
@@ -65,7 +81,11 @@ text, any amount may appear before or after any other token and have no | |
significance to the semantic meaning of a GraphQL query document. Line | ||
terminators are not found within any other token. | ||
|
||
Note: Any error reporting which provide the line number in the source of the | ||
Note: GraphQL intentionally does not consider Unicode line or paragraph | ||
separators outside the ASCII range as line terminators, avoiding | ||
misinterpretation by text editors and source control tools. | ||
|
||
Note: Any error reporting which provides the line number in the source of the | ||
offending syntax should use the preceding amount of {LineTerminator} to produce | ||
the line number. | ||
|
||
|
@@ -83,10 +103,14 @@ A comment can contain any Unicode code point except {LineTerminator} so a | |
comment always consists of all code points starting with the {`#`} character up | ||
to but not including the line terminator. | ||
|
||
Comments behave like white space and may appear after any token, or before a | ||
Comments behave like white space and may appear after any token, or before any | ||
line terminator, and have no significance to the semantic meaning of a GraphQL | ||
query document. | ||
|
||
Any Unicode code point may appear within a Comment. Comments do not include | ||
escape sequences, so the character sequence `\n` or `\u000A` must not be | ||
interpreted as the end of a Comment. | ||
|
||
|
||
### Insignificant Commas | ||
|
||
|
@@ -704,13 +728,13 @@ EscapedUnicode :: /[0-9A-Fa-f]{4}/ | |
|
||
EscapedCharacter :: one of `"` \ `/` b f n r t | ||
|
||
Strings are sequences of characters wrapped in double-quotes (`"`). (ex. | ||
`"Hello World"`). White space and other otherwise-ignored characters are | ||
significant within a string value. | ||
Strings are sequences of zero or more source characters wrapped in double-quotes | ||
(`"`). (ex. `"Hello World"`). | ||
|
||
Note: Unicode characters are allowed within String value literals, however | ||
GraphQL source must not contain some ASCII control characters so escape | ||
sequences must be used to represent these characters. | ||
Any Unicode code point other than those explicitly excluded may appear literally | ||
within a String value. White-space and other characters otherwise ignored | ||
outside of string values are significant and included. Unicode code points may | ||
also be represented with escape sequences. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. … where escape sequences are? (Looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're explained in the Semantics section right below here. I'm open to suggestions for improvements to that if you feel there's something specific that is not precise or clear enough There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I’m happy with that part now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correction: between one and six hexadecimal digits ( |
||
|
||
**Semantics** | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that XML 1.0 forbids much of the same C0 control characters the previous definition did:
https://www.w3.org/TR/xml/#charsets
so if we want to maintain compatibility with XML 1.0 transport, we might want to keep that restriction so we don't mess up existing clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that allowing
U+0000
(NUL) will be interesting for a number of implementations as well as security checks, since C string APIs will assume that NUL is the end of a string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great reference, thanks for pointing that out. I had similar concern about allowing U+0 (see my 2nd comment on this issue) so happy to see that concern raised elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL
handling is hardcore thing :-)By my opinion it shall be best to forbid unescaped NULL character in source AND should be explicitly declared that it must be rejected by tokenizer as an error anywhere in source.
In each reliable implementations should be test case for this case.