Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multibyte characters #10

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

talltyler
Copy link

All functions now operate based on character arrays rather than indexing characters from within the string.
The reason why this is important is 'πŸ‘±πŸ½β€β™‚οΈ'.length is 7 not 1 so accessing string[0] through string[6] would be modifying the bytes in the middle of this single character which also messed with your logic related to how reordering should work.

This seems to fix #9
but the problem can be seen more clearly in this modified version of their example where I changed their emoji to the one above. https://codepen.io/talltyler/pen/jOomxwz

To get this working I'm using Intl.Segmenter which is newish but has pretty good support across browsers along with Node 16 and Deno.
https://caniuse.com/mdn-javascript_builtins_intl_segmenter
If you would rather use something that is more compatible with older environments I can make a change but the only other way I know how to do it is with a huge regex that covers all of the ranges that can connect.

I tried to modify the comments and keep the current api working but if you are ok changing this a few of these functions can be removed.

@tombigel
Copy link

@talltyler thanks for picking up my issue! I wish I had the knowledge to fix it myself...
I've built your fork locally, ran the tests and it fails on a lot of them, are you aware of that?

Also, the original unicode bidi tests lack tests with multibyte characters which is why this implementation didn't pick it up, we need to add some here if we want @lojjic to approve this quicker, I'll try to add a PR with some cases based on BidiCharacterTests.txt "Sequences containing..." sections in the next couple of days

@talltyler
Copy link
Author

I'll spend some time today seeing if I can fix the tests. I don't know much about the details of these cases but hopefully I can find a pattern.

05D0 0028 0332 05D1 0029 0333;0;0;1 1 1 1 1 1;5 4 3 2 1 0
0661 0028 0662 0029 0331;0;0;2 1 2 1 1;4 3 2 1 0
0661 0028 0332 0662 0029 0333;0;0;2 1 1 2 1 1;5 4 3 2 1 0
0061 0028 0062 0029 0331;1;1;2 2 2 1;3 0 1 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you are not supposed to touch this file it is imported directly from the Unicode bidi spec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think BidiCharacterTest.js should also use string.split('')

@talltyler
Copy link
Author

I've made a few changes and the tests are passing but I think this could still be better.

There are multibyte characters in the tests already, the problem is my changes combine things like \u036F with the characters that are before them into a single item in the character array being passed around. The test cases assume these multibyte characters are broken out into their parts so the tests and the values they are testing are off.

I've fixed the cases in BidiCharacterTest.txt but their are 49038 cases in BidiTest.txt that are off.
I assume these tests were generated and I'm hoping there something we can do to regenerate them?

To get the tests in BidiCharacterTest.js to pass I'm not using the stringToArray function, it is just string.split('').
From my understanding this means the changes in this PR work as intended but the test cases need to be updated. Does that sound right? If so I would love some info about how to best go about doing this.

* @return {string[]} - the string broken down into an array of characters.
*/
export function stringToArray (string) {
return [...new Intl.Segmenter().segment(string)].map(x => x.segment);
Copy link

@1ec5 1ec5 Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intl.Segmenter segments by grapheme clusters by default. This is possibly overkill for fixing #9. In some scripts, it will group whole syllables of multiple characters. If the issue is just that surrogate pairs are getting split up, then use the spread operator ([...string]) or the string[Symbol.iterator]() iterator directly on the string you want to split, as described in this documentation:

return [...string];

The standard iterator will also get you whole Unicode codepoints:

for (const char of string) {
  const codePoint = char.codePointAt(0); // never a lone surrogate
}

The string iterator is much more widely supported than Intl.Segmenter, which only landed in Firefox a few months ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getReorderedString of mixed RTL, LTR and Emoji text sometimes differs from browser bidi behavior
3 participants