Use workaround to fix many collation failures in ICU4C. #475

sven-oly · 2025-07-11T23:08:59Z

Test generation in Python puts SMP characters (> 0xffff) as \ud8** surrogate pairs. This works OK for ICU4J, ICU4X, and NodeJS, but fails with the current CPP JSON.

This workaround adds two new items that contain the hex codes for the characters to be compared in s1 and s2. These are set for the NON_IGNORABLE and SHIFTED data files.

This is admittedly a hack, but it points out that the tests with SMP characters were not working. With this hack, over 1200 tests now pass that were previously failing.

sven-oly · 2025-07-16T18:02:41Z

I want to submit this, noting the hack as an issue to be resolved. WDYT?

sffc

I have mixed feelings about this PR; it's nice that it gets C++ passing more test cases, but it is an inelegant solution that impacts the test data for all implementations.

There is no apparent reason why json-c should not be able to handle the supplemental codepoints automatically. We walked through the code together on Monday.

I will approve it, but please open a follow-up issue to clean up the tech debt.

Use workaround to fix many collation failures in ICU4C.

5e92eae

sven-oly assigned echeran and sffc Jul 11, 2025

sffc approved these changes Jul 16, 2025

View reviewed changes

sven-oly and others added 4 commits July 16, 2025 13:26

Fix schema types

19f3ec5

Merge branch 'main' into collation_CPP

466a8d2

Another fix to schema

0da424b

Another fix

222e734

sven-oly merged commit dfbc189 into unicode-org:main Jul 16, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use workaround to fix many collation failures in ICU4C. #475

Use workaround to fix many collation failures in ICU4C. #475

Uh oh!

sven-oly commented Jul 11, 2025

Uh oh!

sven-oly commented Jul 16, 2025

Uh oh!

sffc left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use workaround to fix many collation failures in ICU4C. #475

Use workaround to fix many collation failures in ICU4C. #475

Uh oh!

Conversation

sven-oly commented Jul 11, 2025

Uh oh!

sven-oly commented Jul 16, 2025

Uh oh!

sffc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sffc left a comment •

edited

Loading