Skip to content

[RFC] Support full Unicode character range #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

leebyron
Copy link
Collaborator

This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters.

Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value).

This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.

Suggested by #214

@leebyron
Copy link
Collaborator Author

Drawbacks and concerns I'd like feedback on from folks with more experience dealing with unicode corner cases:

By encouraging use of non-ascii characters in strings, we will inevitably see utf8 files that include supplemental plane and surrogate pair points. Will this break any common tools or create an unnecessary burden?

This removes the restriction of excluding control flow characters from the source. This is in the name of simplicity, however it may have real drawbacks for interaction with other tools. For example, could sources that contain U+0000 be interpreted as the end of a file? If this remains a real concern, it might be worth adding back that restriction just as a point of practical convenience.

@leebyron
Copy link
Collaborator Author

Check out a rendering of this patch applied here: https://ns-gxioxiaqyc.now.sh

@@ -1,6 +1,6 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
SourceCharacter :: "Any Unicode code point from U+0000 to U+10FFFF"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Viz. any Unicode code point.

A [combining character sequence](http://unicode.org/faq/char_combmark.html) are
treated as a sequence of individual Unicode code points and a sequence of
individual {SourceCharacter}, even though they may appear to a user as a
single character.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this introduces the potential to seriously mess with some syntax highlighters by combining characters with quotation marks and things like that. Ah well. Nothing new here, that’s how everything treats such things.

BTW, is there supposed to be a paragraph break before this sentence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to add one

[Unicode](http://unicode.org/standard/standard.html) code points (referred to in
this specification as characters). All Unicode code point values from U+0000 to
U+10FFFF, including surrogate code points, may appear in this sequence where
allowed by the grammatical rules below.
Copy link
Contributor

@chris-morgan chris-morgan Oct 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“including surrogate code points”: I’m not sure what you mean by this. Do you mean that you will allow things like U+DEAD? This is problematic: UTF-8 does not allow surrogates, and a language that works with UTF-8 strings will then break. Rust, for example. (And you can’t express the document in legal UTF-8 then, either.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hope is to both be as non-restrictive as possible while being compatible with as many unicode encodings as possible. There's a note below that the encoding should be irrelevant.

Any given unicode encoding can represent some form of a stream of units between U+0000 and U+10FFFF. The fact that UTF-8 cannot use UTF-16 surrogate pairs, or that UTF-8 can encode invalid points that cannot be encoded in UTF-16 should be irrelevant.

Explicitly, we don't want to make any promises that the GraphQL language will do some specific unicode-specific operations that would be an undesirable burden for implementors who need to use existing languages with existing Unicode quirks. Hence the clauses above and below that GraphQL won't combine surrogate pairs for you nor assemble combining sequences before tokenizing.

If there's a better way to word this that's more clear, I'm definitely open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the encoding is relevant; if you allow the document to include the codepoint U+DEAD, it cannot be expressed in UTF-8. Or UTF-16, when I think about it properly; 0xDEAD does not stand alone; it’s half of the nasty mess that is encoding astral plane characters. U+DEAD cannot be expressed in UTF-16. (That’s why that whole area is blocked off.)

I would strongly advise that surrogates (codepoints like U+DEAD) be disallowed. This doesn’t mean you can’t express things like U+10000 in UTF-16—the surrogate pair there is an aspect of the encoding, and so there is no surrogate in the actual decoded string. There may be a \xDE\xAD there in this hypothetical UTF-16BE, but there’s not a U+DEAD there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.

As such, after decoding, there are a number of code points which just cannot appear, including surrogates.

You shouldn't even include that language. I would just remove it.


The "Byte Order Mark" is a special Unicode character which
may appear at the beginning of a file containing Unicode which programs may use
to determine the fact that the text stream is Unicode, what endianness the text
stream is in, and which of several Unicode encodings to interpret.

GraphQL ignores this character regardless of if it appears at the beginning of a
source, as it may appear within due to concatenation of sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in the middle of a string or a token?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify for those cases

line terminator, and have no significance to the semantic meaning of a GraphQL
query document.

Any Unicode code point may appear within a Comment. Comments do not include
escape sequences, so the character sequence `\n` or `\000A` must not be
interpretted as the end of a Comment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/interpretted/interpreted/

Any Unicode code point other than those explicitly excluded may appear literally
within a String value. White-space and other characters otherwise ignored
outside of string values are significant and included. Unicode code points may
also be represented with escape sequences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… where escape sequences are? (Looks like \u EscapedUnicode still isn’t explained precisely or clearly?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're explained in the Semantics section right below here. I'm open to suggestions for improvements to that if you feel there's something specific that is not precise or clear enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I’m happy with that part now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode, but modern languages support SMP code points via a \u{Hex} syntax which supports between 4 and 6 hexadecimal digits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: between one and six hexadecimal digits (\u{a} is quite legal).

This proposes support for the entire unicode character range in a source document, rather than explicitly excluding certain characters.

Because this doesn't add any new characters to the ignored sequences, in practice this allows for adding more characters to comments (low value) and strings (high value).

This allows strings to include literal unicode points as a way to represent supplemental plane characters. In fact, it almost allows ignoring escape sequences altogether as all characters can be represented literally.
@leebyron
Copy link
Collaborator Author

Updated to incorporate feedback

Copy link

@bhamiltoncx bhamiltoncx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, it's an interesting choice to allow unescaped Unicode only in string values.

Personally, I would just extend the escape syntax to allow \u{1F001} style escapes for SMP code points and leave it at that.

At a high level, I feel like either everything should support Unicode (including identifiers), or the syntax should remain US-ASCII only and require escapes for Unicode literals in strings.

@@ -1,6 +1,6 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
SourceCharacter :: "Any Unicode code point"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that XML 1.0 forbids much of the same C0 control characters the previous definition did:

https://www.w3.org/TR/xml/#charsets

    Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

so if we want to maintain compatibility with XML 1.0 transport, we might want to keep that restriction so we don't mess up existing clients.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that allowing U+0000 (NUL) will be interesting for a number of implementations as well as security checks, since C string APIs will assume that NUL is the end of a string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great reference, thanks for pointing that out. I had similar concern about allowing U+0 (see my 2nd comment on this issue) so happy to see that concern raised elsewhere.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL handling is hardcore thing :-)
By my opinion it shall be best to forbid unescaped NULL character in source AND should be explicitly declared that it must be rejected by tokenizer as an error anywhere in source.
In each reliable implementations should be test case for this case.

[Unicode](http://unicode.org/standard/standard.html) code points (referred to in
this specification as characters). All Unicode code point values from U+0000 to
U+10FFFF, including surrogate code points, may appear in this sequence where
allowed by the grammatical rules below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be explicit that this standard reflects the contents of the source document after it's been decoded from its binary encoding (e.g. UTF-8, UTF-16, etc.) to Unicode.

As such, after decoding, there are a number of code points which just cannot appear, including surrogates.

You shouldn't even include that language. I would just remove it.


### Unicode
However, with the exceptions of {StringValue} and {Comment}, most of GraphQL is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see what the point of this paragraph is. Either GraphQL source documents require non-ASCII values to be escaped, or they don't. I would remove it.

{Comment} portions of GraphQL.
### Byte Order Mark

UnicodeBOM :: "Byte Order Mark (U+FEFF)"
Copy link

@bhamiltoncx bhamiltoncx Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like surrogates, this code point is irrelevant after decoding the encoded document to Unicode. It won't appear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm borrowing this idea from ECMAScript spec where the rationale was based on reading crappy files resulting from concatenating utf-16 files, and the parser breaking. This code point is irrelevant after decoding if it's the first point in the sequence, but IIUC Unicode doesn't do anything about BOM found within a sequence?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the decoder to properly decode files before passing them to GraphQL, including decoding before concatenating.

Any Unicode code point other than those explicitly excluded may appear literally
within a String value. White-space and other characters otherwise ignored
outside of string values are significant and included. Unicode code points may
also be represented with escape sequences.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a new syntax to support SMP escapes. You currently support only code points up to U+FFFF via \u EscapedUnicode, but modern languages support SMP code points via a \u{Hex} syntax which supports between 4 and 6 hexadecimal digits.


The "Byte Order Mark" is a special Unicode character which
may appear at the beginning of a file containing Unicode which programs may use
to determine the fact that the text stream is Unicode, what endianness the text
stream is in, and which of several Unicode encodings to interpret.

GraphQL ignores this character anywhere ignored tokens may occur, regardless of
if it appears at the beginning of a GraphQL document, as it may appear within a
document due to file concatenation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you're going to keep this sentence, suggest "it may appear elsewhere within" rather than "it may appear within".

@andimarek
Copy link
Contributor

@wincent @leebyron can you give an explanation why this PR was closed? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants