Add support for multibyte characters #10

talltyler · 2024-06-04T13:20:41Z

All functions now operate based on character arrays rather than indexing characters from within the string.
The reason why this is important is '👱🏽‍♂️'.length is 7 not 1 so accessing string[0] through string[6] would be modifying the bytes in the middle of this single character which also messed with your logic related to how reordering should work.

This seems to fix #9
but the problem can be seen more clearly in this modified version of their example where I changed their emoji to the one above. https://codepen.io/talltyler/pen/jOomxwz

To get this working I'm using Intl.Segmenter which is newish but has pretty good support across browsers along with Node 16 and Deno.
https://caniuse.com/mdn-javascript_builtins_intl_segmenter
If you would rather use something that is more compatible with older environments I can make a change but the only other way I know how to do it is with a huge regex that covers all of the ranges that can connect.

I tried to modify the comments and keep the current api working but if you are ok changing this a few of these functions can be removed.

… based on character arrays

tombigel · 2024-06-25T19:04:08Z

@talltyler thanks for picking up my issue! I wish I had the knowledge to fix it myself...
I've built your fork locally, ran the tests and it fails on a lot of them, are you aware of that?

Also, the original unicode bidi tests lack tests with multibyte characters which is why this implementation didn't pick it up, we need to add some here if we want @lojjic to approve this quicker, I'll try to add a PR with some cases based on BidiCharacterTests.txt "Sequences containing..." sections in the next couple of days

talltyler · 2024-06-26T15:25:35Z

I'll spend some time today seeing if I can fix the tests. I don't know much about the details of these cases but hopefully I can find a pattern.

tombigel · 2024-06-26T18:47:43Z

test/BidiCharacterTest.txt

-05D0 0028 0332 05D1 0029 0333;0;0;1 1 1 1 1 1;5 4 3 2 1 0
-0661 0028 0662 0029 0331;0;0;2 1 2 1 1;4 3 2 1 0
-0661 0028 0332 0662 0029 0333;0;0;2 1 1 2 1 1;5 4 3 2 1 0
+0061 0028 0062 0029 0331;1;1;2 2 2 1;3 0 1 2


I believe you are not supposed to touch this file it is imported directly from the Unicode bidi spec

Do you think BidiCharacterTest.js should also use string.split('')

talltyler · 2024-06-26T18:49:01Z

I've made a few changes and the tests are passing but I think this could still be better.

There are multibyte characters in the tests already, the problem is my changes combine things like \u036F with the characters that are before them into a single item in the character array being passed around. The test cases assume these multibyte characters are broken out into their parts so the tests and the values they are testing are off.

I've fixed the cases in BidiCharacterTest.txt but their are 49038 cases in BidiTest.txt that are off.
I assume these tests were generated and I'm hoping there something we can do to regenerate them?

To get the tests in BidiCharacterTest.js to pass I'm not using the stringToArray function, it is just string.split('').
From my understanding this means the changes in this PR work as intended but the test cases need to be updated. Does that sound right? If so I would love some info about how to best go about doing this.

1ec5 · 2024-08-24T17:16:25Z

src/util/stringToArray.js

+ * @return {string[]} - the string broken down into an array of characters.
+ */
+export function stringToArray (string) {
+  return [...new Intl.Segmenter().segment(string)].map(x => x.segment);


Intl.Segmenter segments by grapheme clusters by default. This is possibly overkill for fixing #9. In some scripts, it will group whole syllables of multiple characters. If the issue is just that surrogate pairs are getting split up, then use the spread operator ([...string]) or the string[Symbol.iterator]() iterator directly on the string you want to split, as described in this documentation:

return [...string];

The standard iterator will also get you whole Unicode codepoints:

for (const char of string) { const codePoint = char.codePointAt(0); // never a lone surrogate }

The string iterator is much more widely supported than Intl.Segmenter, which only landed in Firefox a few months ago.

talltyler added 2 commits June 4, 2024 08:53

This adds support for multibyte characters, all functions now operate…

18fdc9d

… based on character arrays

misspelling

c696aab

trying to fix the unit tests

0416b8a

tombigel reviewed Jun 26, 2024

View reviewed changes

1ec5 reviewed Aug 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multibyte characters #10

Add support for multibyte characters #10

talltyler commented Jun 4, 2024

tombigel commented Jun 25, 2024

talltyler commented Jun 26, 2024

tombigel Jun 26, 2024

talltyler Jun 26, 2024

talltyler commented Jun 26, 2024

1ec5 Aug 24, 2024 •

edited

Loading

Add support for multibyte characters #10

Are you sure you want to change the base?

Add support for multibyte characters #10

Conversation

talltyler commented Jun 4, 2024

tombigel commented Jun 25, 2024

talltyler commented Jun 26, 2024

tombigel Jun 26, 2024

Choose a reason for hiding this comment

talltyler Jun 26, 2024

Choose a reason for hiding this comment

talltyler commented Jun 26, 2024

1ec5 Aug 24, 2024 • edited Loading

Choose a reason for hiding this comment

1ec5 Aug 24, 2024 •

edited

Loading