Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Unicode support for identifier names #1003

Merged
merged 25 commits into from
Jan 10, 2021

Conversation

jevancc
Copy link
Contributor

@jevancc jevancc commented Dec 28, 2020

This PR implements the complete syntax (except UnicodeEscapeSequence) of "Names and Keywords", also known as an identifier, in the ECMAScript spec. All the identifier name related tests (except the escaped ones) in test262 are passed in this PR.

The implementation is based on Unicode® Standard Annex with Unicode version 13.0.0. The crate unicode-general-category v0.3.0 is added to lookup character categories, and of course it implements Unicode v13.0.0. Sad to say, there are no existing update-to-date crates that implement the identifier properties lookup functions we need, so I end up building tables in this PR.

The definition of UnicodeEscapeSequence in identifier name similar to the Unicode escape in string literal. I will work on that with a minor refactor to see if we can reuse it in both parts after this PR is merged.

@codecov
Copy link

codecov bot commented Dec 28, 2020

Codecov Report

Merging #1003 (cd1b07f) into master (c3ba744) will increase coverage by 0.07%.
The diff coverage is 70.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1003      +/-   ##
==========================================
+ Coverage   58.30%   58.37%   +0.07%     
==========================================
  Files         172      173       +1     
  Lines       12008    12050      +42     
==========================================
+ Hits         7001     7034      +33     
- Misses       5007     5016       +9     
Impacted Files Coverage Δ
boa/src/syntax/lexer/identifier.rs 61.53% <69.23%> (+1.53%) ⬆️
boa_unicode/src/lib.rs 69.69% <69.69%> (ø)
boa/src/syntax/lexer/mod.rs 66.66% <75.00%> (ø)
boa/src/syntax/lexer/cursor.rs 70.93% <0.00%> (+1.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3ba744...407ebc9. Read the comment docs.

@RageKnify
Copy link
Contributor

RageKnify commented Jan 1, 2021

Test262 conformance changes:

Test result master count PR count difference
Total 78,497 78,497 0
Passed 24,550 24,584 +34
Ignored 15,585 15,585 0
Failed 38,362 38,328 -34
Panics 24 24 0
Conformance 31.28 31.32 +0.04%

Numbers look good!

@RageKnify RageKnify added the lexer Issues surrounding the lexer label Jan 1, 2021
@RageKnify RageKnify requested review from Razican and Lan2u January 1, 2021 19:35
@RageKnify RageKnify added this to the v0.11.0 milestone Jan 1, 2021
Copy link
Member

@Razican Razican left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking pretty good. I added a couple of comments :)

boa/src/syntax/lexer/identifier_unicode_tables.rs Outdated Show resolved Hide resolved
boa/src/syntax/lexer/mod.rs Show resolved Hide resolved
@jevancc jevancc requested a review from Razican January 4, 2021 01:23
Copy link
Member

@Razican Razican left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much, much cleaner now. I'm only missing a bit of documentation, but for the rest, it's good to go from my side :)

boa_unicode/src/lib.rs Show resolved Hide resolved
boa_unicode/build_tables.js Show resolved Hide resolved
boa_unicode/build_tables.js Outdated Show resolved Hide resolved
@jevancc
Copy link
Contributor Author

jevancc commented Jan 7, 2021

@Razican I'm thinking to check the Unicode version of unicode-general-category in building or testing and throw warning/error when mismatch. It is an external dependency and the implementations' Unicode versions may mismatch when that crate is updated but the identifier properties tables are not.

Where do you think we can add this check?

@jevancc jevancc force-pushed the unicode_identifier branch from b3646a3 to 0e7bf9e Compare January 7, 2021 06:42
@Razican
Copy link
Member

Razican commented Jan 7, 2021

@Razican I'm thinking to check the Unicode version of unicode-general-category in building or testing and throw warning/error when mismatch. It is an external dependency and the implementations' Unicode versions may mismatch when that crate is updated but the identifier properties tables are not.

Where do you think we can add this check?

This could be checked either in a build.rs file in boa_unicode.

@jevancc
Copy link
Contributor Author

jevancc commented Jan 7, 2021

@Razican I'm thinking to check the Unicode version of unicode-general-category in building or testing and throw warning/error when mismatch. It is an external dependency and the implementations' Unicode versions may mismatch when that crate is updated but the identifier properties tables are not.
Where do you think we can add this check?

This could be checked either in a build.rs file in boa_unicode.

It won't work. build.rs cannot access regular dependencies. It's fine to not have this check, just manually check it every time before updating the dependency unicode-general-category.

@Razican
Copy link
Member

Razican commented Jan 7, 2021

It won't work. build.rs cannot access regular dependencies. It's fine to not have this check, just manually check it every time before updating the dependency unicode-general-category.

You can also add it as a "build" dependency and check it, but it's true that those versions can get out of sync. Could we maybe add a test in boa_unicode that fails if versions are out of sync? That way, when we execute the test suite, it will fail.

boa_unicode/build_tables.js Show resolved Hide resolved
@jevancc jevancc requested a review from Razican January 10, 2021 06:47
@jevancc jevancc force-pushed the unicode_identifier branch from 0856405 to 407ebc9 Compare January 10, 2021 07:01
Copy link
Member

@Razican Razican left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, this looks very good to me now :)

@Razican Razican merged commit f66a099 into boa-dev:master Jan 10, 2021
@RageKnify RageKnify mentioned this pull request Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lexer Issues surrounding the lexer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants