-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[RFC] Support full Unicode character range #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Drawbacks and concerns I'd like feedback on from folks with more experience dealing with unicode corner cases: By encouraging use of non-ascii characters in strings, we will inevitably see utf8 files that include supplemental plane and surrogate pair points. Will this break any common tools or create an unnecessary burden? This removes the restriction of excluding control flow characters from the source. This is in the name of simplicity, however it may have real drawbacks for interaction with other tools. For example, could sources that contain U+0000 be interpreted as the end of a file? If this remains a real concern, it might be worth adding back that restriction just as a point of practical convenience. |
Check out a rendering of this patch applied here: https://ns-gxioxiaqyc.now.sh |
@@ -1,6 +1,6 @@ | |||
# B. Appendix: Grammar Summary | |||
|
|||
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ | |||
SourceCharacter :: "Any Unicode code point from U+0000 to U+10FFFF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Viz. any Unicode code point.
A [combining character sequence](http://unicode.org/faq/char_combmark.html) are | ||
treated as a sequence of individual Unicode code points and a sequence of | ||
individual {SourceCharacter}, even though they may appear to a user as a | ||
single character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this introduces the potential to seriously mess with some syntax highlighters by combining characters with quotation marks and things like that. Ah well. Nothing new here, that’s how everything treats such things.
BTW, is there supposed to be a paragraph break before this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion to add one
[Unicode](http://unicode.org/standard/standard.html) code points (referred to in | ||
this specification as characters). All Unicode code point values from U+0000 to | ||
U+10FFFF, including surrogate code points, may appear in this sequence where | ||
allowed by the grammatical rules below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“including surrogate code points”: I’m not sure what you mean by this. Do you mean that you will allow things like U+DEAD? This is problematic: UTF-8 does not allow surrogates, and a language that works with UTF-8 strings will then break. Rust, for example. (And you can’t express the document in legal UTF-8 then, either.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hope is to both be as non-restrictive as possible while being compatible with as many unicode encodings as possible. There's a note below that the encoding should be irrelevant.
Any given unicode encoding can represent some form of a stream of units between U+0000 and U+10FFFF. The fact that UTF-8 cannot use UTF-16 surrogate pairs, or that UTF-8 can encode invalid points that cannot be encoded in UTF-16 should be irrelevant.
Explicitly, we don't want to make any promises that the GraphQL language will do some specific unicode-specific operations that would be an undesirable burden for implementors who need to use existing languages with existing Unicode quirks. Hence the clauses above and below that GraphQL won't combine surrogate pairs for you nor assemble combining sequences before tokenizing.
If there's a better way to word this that's more clear, I'm definitely open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the encoding is relevant; if you allow the document to include the codepoint U+DEAD, it cannot be expressed in UTF-8. Or UTF-16, when I think about it properly; 0xDEAD does not stand alone; it’s half of the nasty mess that is encoding astral plane characters. U+DEAD cannot be expressed in UTF-16. (That’s why that whole area is blocked off.)
I would strongly advise that surrogates (codepoints like U+DEAD) be disallowed. This doesn’t mean you can’t express things like U+10000 in UTF-16—the surrogate pair there is an aspect of the encoding, and so there is no surrogate in the actual decoded string. There may be a \xDE\xAD
there in this hypothetical UTF-16BE, but there’s not a U+DEAD there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.
As such, after decoding, there are a number of code points which just cannot appear, including surrogates.
You shouldn't even include that language. I would just remove it.
|
||
The "Byte Order Mark" is a special Unicode character which | ||
may appear at the beginning of a file containing Unicode which programs may use | ||
to determine the fact that the text stream is Unicode, what endianness the text | ||
stream is in, and which of several Unicode encodings to interpret. | ||
|
||
GraphQL ignores this character regardless of if it appears at the beginning of a | ||
source, as it may appear within due to concatenation of sources. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about in the middle of a string or a token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify for those cases
line terminator, and have no significance to the semantic meaning of a GraphQL | ||
query document. | ||
|
||
Any Unicode code point may appear within a Comment. Comments do not include | ||
escape sequences, so the character sequence `\n` or `\000A` must not be | ||
interpretted as the end of a Comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/interpretted/interpreted/
Any Unicode code point other than those explicitly excluded may appear literally | ||
within a String value. White-space and other characters otherwise ignored | ||
outside of string values are significant and included. Unicode code points may | ||
also be represented with escape sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… where escape sequences are? (Looks like \u EscapedUnicode
still isn’t explained precisely or clearly?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're explained in the Semantics section right below here. I'm open to suggestions for improvements to that if you feel there's something specific that is not precise or clear enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I’m happy with that part now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode
, but modern languages support SMP code points via a \u{Hex}
syntax which supports between 4 and 6 hexadecimal digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: between one and six hexadecimal digits (\u{a}
is quite legal).
This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters. Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value). This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.
104af0d
to
34cb144
Compare
Updated to incorporate feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, it's an interesting choice to allow unescaped Unicode only in string values.
Personally, I would just extend the escape syntax to allow \u{1F001}
style escapes for SMP code points and leave it at that.
At a high level, I feel like either everything should support Unicode (including identifiers), or the syntax should remain US-ASCII only and require escapes for Unicode literals in strings.
@@ -1,6 +1,6 @@ | |||
# B. Appendix: Grammar Summary | |||
|
|||
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ | |||
SourceCharacter :: "Any Unicode code point" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that XML 1.0 forbids much of the same C0 control characters the previous definition did:
https://www.w3.org/TR/xml/#charsets
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
so if we want to maintain compatibility with XML 1.0 transport, we might want to keep that restriction so we don't mess up existing clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that allowing U+0000
(NUL) will be interesting for a number of implementations as well as security checks, since C string APIs will assume that NUL is the end of a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great reference, thanks for pointing that out. I had similar concern about allowing U+0 (see my 2nd comment on this issue) so happy to see that concern raised elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL
handling is hardcore thing :-)
By my opinion it shall be best to forbid unescaped NULL character in source AND should be explicitly declared that it must be rejected by tokenizer as an error anywhere in source.
In each reliable implementations should be test case for this case.
[Unicode](http://unicode.org/standard/standard.html) code points (referred to in | ||
this specification as characters). All Unicode code point values from U+0000 to | ||
U+10FFFF, including surrogate code points, may appear in this sequence where | ||
allowed by the grammatical rules below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.
As such, after decoding, there are a number of code points which just cannot appear, including surrogates.
You shouldn't even include that language. I would just remove it.
|
||
### Unicode | ||
However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see what the point of this paragraph is. Either GraphQL source documents require non-ASCII values to be escaped, or they don't. I would remove it.
{Comment} portions of GraphQL. | ||
### Byte Order Mark | ||
|
||
UnicodeBOM :: "Byte Order Mark (U+FEFF)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like surrogates, this code point is irrelevant after decoding the encoded document to Unicode. It won't appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm borrowing this idea from ECMAScript spec where the rationale was based on reading crappy files resulting from concatenating utf-16 files, and the parser breaking. This code point is irrelevant after decoding if it's the first point in the sequence, but IIUC Unicode doesn't do anything about BOM found within a sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to the decoder to properly decode files before passing them to GraphQL, including decoding before concatenating.
Any Unicode code point other than those explicitly excluded may appear literally | ||
within a String value. White-space and other characters otherwise ignored | ||
outside of string values are significant and included. Unicode code points may | ||
also be represented with escape sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode
, but modern languages support SMP code points via a \u{Hex}
syntax which supports between 4 and 6 hexadecimal digits.
|
||
The "Byte Order Mark" is a special Unicode character which | ||
may appear at the beginning of a file containing Unicode which programs may use | ||
to determine the fact that the text stream is Unicode, what endianness the text | ||
stream is in, and which of several Unicode encodings to interpret. | ||
|
||
GraphQL ignores this character anywhere ignored tokens may occur, regardless of | ||
if it appears at the beginning of a GraphQL document, as it may appear within a | ||
document due to file concatenation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if you're going to keep this sentence, suggest "it may appear elsewhere within" rather than "it may appear within".
This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters.
Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value).
This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.
Suggested by #214