[RFC] Support full Unicode character range #231

leebyron · 2016-10-28T21:54:14Z

This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters.

Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value).

This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.

Suggested by #214

leebyron · 2016-10-28T21:58:25Z

Drawbacks and concerns I'd like feedback on from folks with more experience dealing with unicode corner cases:

By encouraging use of non-ascii characters in strings, we will inevitably see utf8 files that include supplemental plane and surrogate pair points. Will this break any common tools or create an unnecessary burden?

This removes the restriction of excluding control flow characters from the source. This is in the name of simplicity, however it may have real drawbacks for interaction with other tools. For example, could sources that contain U+0000 be interpreted as the end of a file? If this remains a real concern, it might be worth adding back that restriction just as a point of practical convenience.

leebyron · 2016-10-28T22:13:43Z

Check out a rendering of this patch applied here: https://ns-gxioxiaqyc.now.sh

chris-morgan · 2016-10-29T05:17:43Z

spec/Appendix B -- Grammar Summary.md

@@ -1,6 +1,6 @@
 # B. Appendix: Grammar Summary

-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+SourceCharacter :: "Any Unicode code point from U+0000 to U+10FFFF"


Viz. any Unicode code point.

chris-morgan · 2016-10-29T05:20:10Z

spec/Section 2 -- Language.md

+A [combining character sequence](http://unicode.org/faq/char_combmark.html) are
+treated as a sequence of individual Unicode code points and a sequence of
+individual {SourceCharacter}, even though they may appear to a user as a
+single character.


And this introduces the potential to seriously mess with some syntax highlighters by combining characters with quotation marks and things like that. Ah well. Nothing new here, that’s how everything treats such things.

BTW, is there supposed to be a paragraph break before this sentence?

Good suggestion to add one

chris-morgan · 2016-10-29T05:26:06Z

spec/Section 2 -- Language.md

+[Unicode](http://unicode.org/standard/standard.html) code points (referred to in
+this specification as characters). All Unicode code point values from U+0000 to
+U+10FFFF, including surrogate code points, may appear in this sequence where
+allowed by the grammatical rules below.


“including surrogate code points”: I’m not sure what you mean by this. Do you mean that you will allow things like U+DEAD? This is problematic: UTF-8 does not allow surrogates, and a language that works with UTF-8 strings will then break. Rust, for example. (And you can’t express the document in legal UTF-8 then, either.)

The hope is to both be as non-restrictive as possible while being compatible with as many unicode encodings as possible. There's a note below that the encoding should be irrelevant.

Any given unicode encoding can represent some form of a stream of units between U+0000 and U+10FFFF. The fact that UTF-8 cannot use UTF-16 surrogate pairs, or that UTF-8 can encode invalid points that cannot be encoded in UTF-16 should be irrelevant.

Explicitly, we don't want to make any promises that the GraphQL language will do some specific unicode-specific operations that would be an undesirable burden for implementors who need to use existing languages with existing Unicode quirks. Hence the clauses above and below that GraphQL won't combine surrogate pairs for you nor assemble combining sequences before tokenizing.

If there's a better way to word this that's more clear, I'm definitely open to suggestions.

The problem is that the encoding is relevant; if you allow the document to include the codepoint U+DEAD, it cannot be expressed in UTF-8. Or UTF-16, when I think about it properly; 0xDEAD does not stand alone; it’s half of the nasty mess that is encoding astral plane characters. U+DEAD cannot be expressed in UTF-16. (That’s why that whole area is blocked off.)

I would strongly advise that surrogates (codepoints like U+DEAD) be disallowed. This doesn’t mean you can’t express things like U+10000 in UTF-16—the surrogate pair there is an aspect of the encoding, and so there is no surrogate in the actual decoded string. There may be a \xDE\xAD there in this hypothetical UTF-16BE, but there’s not a U+DEAD there.

I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.

As such, after decoding, there are a number of code points which just cannot appear, including surrogates.

You shouldn't even include that language. I would just remove it.

chris-morgan · 2016-10-29T05:27:59Z

spec/Section 2 -- Language.md


 The "Byte Order Mark" is a special Unicode character which
 may appear at the beginning of a file containing Unicode which programs may use
 to determine the fact that the text stream is Unicode, what endianness the text
 stream is in, and which of several Unicode encodings to interpret.

+GraphQL ignores this character regardless of if it appears at the beginning of a
+source, as it may appear within due to concatenation of sources.
+


What about in the middle of a string or a token?

I'll clarify for those cases

chris-morgan · 2016-10-29T05:28:53Z

spec/Section 2 -- Language.md

 line terminator, and have no significance to the semantic meaning of a GraphQL
 query document.

+Any Unicode code point may appear within a Comment. Comments do not include
+escape sequences, so the character sequence `\n` or `\000A` must not be
+interpretted as the end of a Comment.


s/interpretted/interpreted/

chris-morgan · 2016-10-29T05:36:57Z

spec/Section 2 -- Language.md

+Any Unicode code point other than those explicitly excluded may appear literally
+within a String value. White-space and other characters otherwise ignored
+outside of string values are significant and included. Unicode code points may
+also be represented with escape sequences.


… where escape sequences are? (Looks like \u EscapedUnicode still isn’t explained precisely or clearly?)

They're explained in the Semantics section right below here. I'm open to suggestions for improvements to that if you feel there's something specific that is not precise or clear enough

OK, I’m happy with that part now.

I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode, but modern languages support SMP code points via a \u{Hex} syntax which supports between 4 and 6 hexadecimal digits.

Correction: between one and six hexadecimal digits (\u{a} is quite legal).

This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters. Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value). This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.

leebyron · 2016-10-31T19:34:46Z

Updated to incorporate feedback

bhamiltoncx

At a high level, it's an interesting choice to allow unescaped Unicode only in string values.

Personally, I would just extend the escape syntax to allow \u{1F001} style escapes for SMP code points and leave it at that.

At a high level, I feel like either everything should support Unicode (including identifiers), or the syntax should remain US-ASCII only and require escapes for Unicode literals in strings.

bhamiltoncx · 2016-10-31T20:24:39Z

spec/Appendix B -- Grammar Summary.md

@@ -1,6 +1,6 @@
 # B. Appendix: Grammar Summary

-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+SourceCharacter :: "Any Unicode code point"


One thing to note is that XML 1.0 forbids much of the same C0 control characters the previous definition did:

https://www.w3.org/TR/xml/#charsets

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

so if we want to maintain compatibility with XML 1.0 transport, we might want to keep that restriction so we don't mess up existing clients.

Also note that allowing U+0000 (NUL) will be interesting for a number of implementations as well as security checks, since C string APIs will assume that NUL is the end of a string.

Great reference, thanks for pointing that out. I had similar concern about allowing U+0 (see my 2nd comment on this issue) so happy to see that concern raised elsewhere.

NULL handling is hardcore thing :-)
By my opinion it shall be best to forbid unescaped NULL character in source AND should be explicitly declared that it must be rejected by tokenizer as an error anywhere in source.
In each reliable implementations should be test case for this case.

bhamiltoncx · 2016-10-31T20:27:49Z

spec/Section 2 -- Language.md

+[Unicode](http://unicode.org/standard/standard.html) code points (referred to in
+this specification as characters). All Unicode code point values from U+0000 to
+U+10FFFF, including surrogate code points, may appear in this sequence where
+allowed by the grammatical rules below.


I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.

As such, after decoding, there are a number of code points which just cannot appear, including surrogates.

You shouldn't even include that language. I would just remove it.

bhamiltoncx · 2016-10-31T20:28:49Z

spec/Section 2 -- Language.md


-### Unicode
+However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is


I don't really see what the point of this paragraph is. Either GraphQL source documents require non-ASCII values to be escaped, or they don't. I would remove it.

bhamiltoncx · 2016-10-31T20:29:19Z

spec/Section 2 -- Language.md

-{Comment} portions of GraphQL.
+### Byte Order Mark
+
+UnicodeBOM :: "Byte Order Mark (U+FEFF)"


Like surrogates, this code point is irrelevant after decoding the encoded document to Unicode. It won't appear.

I'm borrowing this idea from ECMAScript spec where the rationale was based on reading crappy files resulting from concatenating utf-16 files, and the parser breaking. This code point is irrelevant after decoding if it's the first point in the sequence, but IIUC Unicode doesn't do anything about BOM found within a sequence?

It's up to the decoder to properly decode files before passing them to GraphQL, including decoding before concatenating.

bhamiltoncx · 2016-10-31T20:31:34Z

spec/Section 2 -- Language.md

+Any Unicode code point other than those explicitly excluded may appear literally
+within a String value. White-space and other characters otherwise ignored
+outside of string values are significant and included. Unicode code points may
+also be represented with escape sequences.


I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode, but modern languages support SMP code points via a \u{Hex} syntax which supports between 4 and 6 hexadecimal digits.

wincent · 2016-11-08T00:19:54Z

spec/Section 2 -- Language.md


 The "Byte Order Mark" is a special Unicode character which
 may appear at the beginning of a file containing Unicode which programs may use
 to determine the fact that the text stream is Unicode, what endianness the text
 stream is in, and which of several Unicode encodings to interpret.

+GraphQL ignores this character anywhere ignored tokens may occur, regardless of
+if it appears at the beginning of a GraphQL document, as it may appear within a
+document due to file concatenation.


nit: if you're going to keep this sentence, suggest "it may appear elsewhere within" rather than "it may appear within".

andimarek · 2019-07-15T05:02:33Z

@wincent @leebyron can you give an explanation why this PR was closed? thanks

leebyron mentioned this pull request Oct 28, 2016

Allow unicode characters outside the Basic Multilingual Plane #214

Closed

chris-morgan reviewed Oct 29, 2016

View reviewed changes

leebyron force-pushed the unicode-upgrade branch from 104af0d to 34cb144 Compare October 31, 2016 19:34

bhamiltoncx reviewed Oct 31, 2016

View reviewed changes

wincent reviewed Nov 8, 2016

View reviewed changes

facebook-github-bot added the CLA Signed label Dec 29, 2016

stubailo mentioned this pull request Jan 17, 2017

RFC: extend identifiers to support non-ASCII characters. #256

Open

wincent closed this Aug 1, 2017

wincent deleted the unicode-upgrade branch August 1, 2017 17:26

andimarek mentioned this pull request Feb 8, 2020

allow full unicode range #687

Closed


		### Unicode
		However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is

[RFC] Support full Unicode character range #231

[RFC] Support full Unicode character range #231

Uh oh!

Conversation

leebyron commented Oct 28, 2016

Uh oh!

leebyron commented Oct 28, 2016

Uh oh!

leebyron commented Oct 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-morgan Oct 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leebyron commented Oct 31, 2016

Uh oh!

bhamiltoncx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhamiltoncx Oct 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andimarek commented Jul 15, 2019

Uh oh!

Uh oh!

chris-morgan Oct 29, 2016 •

edited

Loading

bhamiltoncx Oct 31, 2016 •

edited

Loading