-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Unicode] Use Unicode 9’s ID_Start
& ID_Continue
for identifiers
#1208
Comments
I'll add that ECMA 2017 has a stipulation for which Unicode Standard version must be supported: https://tc39.github.io/ecma262/#sec-conformance
No reason we couldn't add this but it's not a high priority under the current version of the spec. @bterlson Does that seem like a fair assessment? |
The intent is to use the latest available Unicode version. See tc39/ecma262#620 for some discussion. @bterlson Perhaps it would be clearer to merge that PR now, and then update to refer to the (version-less) latest version in a future PR? |
@mathiasbynens It's only a few weeks until we have the final consensus on version-less latest reference :) But I see your point. We should be tracking Unicode 9 at this point. |
@bterlson I'm going ahead with this change, but just wanted to follow up about whether consensus was reached? |
ID_Start
& ID_Continue
for identifiersID_Start
& ID_Continue
for identifiers
@mathiasbynens I took your result script and added the following lines: console.log("number of new ID_Start symbols: " + new_ID_Start.length);
console.log("number of new ID_Continue symbols: " + new_ID_Continue.length);
console.log(new_ID_Continue.length - new_ID_Start.length); And see the following counts:
That doesn't quite match with the numbers included in your original post, so I'd like to clarify to make sure we're on the same page. If I'm not mistaken, all ID_Start characters can be used as ID_Continue, but not all ID_Continue can be used as ID_Start. (If I understand correctly, ID_Continue would include numerals, which cannot be used as ID_Start.) It looks like the ES standard specifies using these Unicode character classes as they are. (Side note: I find it interesting that it falls into the realm of the Unicode standard to define what characters can be used for identifiers, given that it seems like a language-specific implementation detail.) |
@dilijev The numbers I posted were wrong, indeed. The script and the output it produces (and the numbers you logged) are correct. Here’s how I confirmed this:
I’ve updated the top post accordingly.
Interestingly, ES5 defined a list of Unicode |
Marking as External. It looks like compliance with a particular Unicode version is an external library issue on all platforms. On Windows we use For non-Windows we use ICU's Confirmed the test cases above do not work in node 6.9.1 but do work in node 7.1.0. |
Note that the different external libraries have different levels of Unicode versions supported. IIRC, the ICU version we support supports Unicode 8? Last I checked, Windows.Globalization.dll supported Unicode 6.1 or 6.2, IIRC. cc @bterlson |
Hmm in that case it might not be reasonable to wait for Windows to update their APIs for Unicode -- unless we plan to just accept that Unicode compliance on Windows will lag behind. Would it be reasonable to take a dependency on ICU for ChakraCore and/or Chakra to resolve this issue? (If we did that it would most likely be out of scope for milestone 1.4.) ICU can be used for Intl and Unicode support (although Intl support is turned off in xplat builds at the moment.) Is classification of characters a small enough component of ICU that we could implement it in our code base as a mini-ICU component rather than waiting for the entire set of Windows / ICU Unicode and Intl functions we depend on to catch up? |
Removing this from Milestone 1.4. If we wait for Windows to update the Unicode APIs, there's no action to take here. If we were to implement this functionality or change dependencies, it is too risky for this milestone. |
TC39's intent is generally that the latest version of the Unicode standard should be used in the latest version of the ECMAScript standard. Removed "Pending TC39 Consensus" label. |
Closing this issue as External: by current design these classifications are determined by calling APIs in external libraries. We can open a new issue if we decide we want to embed the ID_START and ID_CONTINUE classifications of characters directly into ChakraCore. |
ID_Start
symbols; Unicode 9 has 117,007, i.e. 7,177 more (no removals).ID_Continue
symbols; Unicode 9 has 119,691, i.e. 7,339 more (no removals).E.g. these should not throw per Unicode 9:
I’ve attached a tarball containing
results.js
which contains the full list of newID_Start
andID_Continue
symbols in Unicode 9, and the Node.js script used to generate it.unicode-9-identifiers.tar.gz
See also:
The text was updated successfully, but these errors were encountered: