-
-
Notifications
You must be signed in to change notification settings - Fork 34.1k
readline: use icu based string width calculation #9040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Technically semver-minor because it allows now a single codepoint to be passed in as the argument to |
|
This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved. |
|
Intl and non-intl builds already behave differently in many ways. This On Wednesday, October 12, 2016, Ben Noordhuis notifications@github.com
|
Oh? In what way?
"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is. |
Off the top of my head:
There's likely more. It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm. this PR: vs. latest v6: (the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine) |
lib/internal/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need /g?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses /g.
lib/internal/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should initially assign no-ops with comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come? that does not seem very practical.
lib/internal/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not detectable from process.config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using process.config is unreliable because there are userland packages that override it. That's why process.binding('config') was introduced.
lib/internal/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this remain isNaN() ... or rather, Number.isNaN()?
lib/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment about these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the existing comment on line 128 and 129 not enough?
c133999 to
83c7a88
Compare
cc56bd3 to
5e5522c
Compare
|
New CI: https://ci.nodejs.org/job/node-test-pull-request/4611/ |
5e5522c to
bf886f8
Compare
|
Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review. |
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.
lib/internal/readline.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex could use a comment explaining what it tries to match/capture. Also, four space indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fall-through is intentional, can you add // fallthrough comments?
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put braces around the body, slightly easier to read.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: snake_case
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reinterpret_cast shouldn't be necessary, should it?
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces?
|
Updated to address the nits. |
|
@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/ |
|
@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests. |
Updated. Needs another review from @bnoordhuis
bnoordhuis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a comment and a style nit.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you line up the argument?
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TwoByteValue's constructor takes a Local<Value>. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.
srl295
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, with a couple of minor comments
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U+200D is a ZWJ (zero width joiner), please use that term
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasoable way of doing this calculation
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here
5dcbf74 to
924eef3
Compare
|
Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/ |
924eef3 to
518c711
Compare
|
had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/ |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: nodejs#8075
518c711 to
876bb83
Compare
|
CI is green except for unrelated flaky failures. Landing |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
|
Landed in 72547fe |
|
Shrinking it down ought to be fine. I'll do that this next week On Sunday, October 23, 2016, Johan Bergström notifications@github.com
|
|
@jasnell g++ spotted a bug: I think you need a U16_NEXT call outside the loop. |
|
Hmm.. Ok, away from the laptop at the moment but will look at that shortly |
|
@bnoordhuis good catch - otherwise A |
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster. This was originally part of a proposal to add a new unicode module, but has been split out. Refs: #8075 PR-URL: #9040 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Steven R Loomis <srloomis@us.ibm.com>
|
/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered? |
|
I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this |
|
no pressing reason |
Checklist
make -j8 test(UNIX), orvcbuild test nosign(Windows) passesAffected core subsystem(s)
readline, internal
Description of change
Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.
This was originally part of a proposal to add a new unicode module, but has been split out.
Refs: #8075