Skip to content

Conversation

@mcdurdin
Copy link
Member

@mcdurdin mcdurdin commented Jan 28, 2026

When normalizing, we need to stop processing on an NFC boundary, not an NFD boundary, to support normalizations such as in Bengali, where appending U+09D7 to a context of U+0995 U+09C7 should result in U+0995 U+09CC.

The specification is unclear on this; see https://unicode-org.atlassian.net/browse/CLDR-19218

This also updates the ldml keyboard unit test suite to support running in full NFC mode (used in all Engine implementations) as well retaining the NFD mode (now only used by the debugger).

Side note: the Bengali normalization failure case was picked up by the improvements to the unit test suite, proving once again that good tests are so valuable.

Fixes: #15491
Fixes: #15505
Follows: #15488
Relates-to: CLDR-19218
Build-bot: release:windows,linux,mac

User Testing

Tests should be run with the following keyboard: bn_ldml.zip

  • GROUP_WINDOWS: Test on Windows

  • GROUP_MAC: Test on macOS

  • GROUP_LINUX: Test on Linux

  • TEST_NORMALIZATION: Type deA. Copy the output and paste it into a character viewer. The output should be U+09A6 U+09CC.

@github-project-automation github-project-automation bot moved this to Todo in Keyman Jan 28, 2026
@keymanapp-test-bot keymanapp-test-bot bot added has-user-test user-test-missing User tests have not yet been defined for the PR labels Jan 28, 2026
@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Jan 28, 2026

User Test Results

Test specification and instructions

  • ✅ GROUP_WINDOWS: Test on Windows

    1 tests PASSED
  • ✅ GROUP_MAC: Test on macOS

    1 tests PASSED
  • ✅ GROUP_LINUX: Test on Linux

    1 tests PASSED
    • TEST_NORMALIZATION (PASSED): verified okay, U+09A6 U+09CC result.

@keymanapp-test-bot keymanapp-test-bot bot added this to the A19S21 milestone Jan 28, 2026
@github-actions github-actions bot added core/ Keyman Core fix labels Jan 28, 2026
@keymanapp-test-bot keymanapp-test-bot bot added user-test-required User tests have not been completed and removed user-test-missing User tests have not yet been defined for the PR labels Jan 28, 2026
@mcdurdin mcdurdin requested review from ermshiperete and srl295 and removed request for srl295 January 28, 2026 03:29
@mcdurdin mcdurdin force-pushed the fix/core/15491-15505-bengali-normalization-and-tests branch from f2cde17 to 82083f8 Compare January 28, 2026 03:30
When normalizing, we need to stop processing on an NFC boundary, not an
NFD boundary, to support normalizations such as in Bengali, where
appending `U+09D7` to a context of `U+0995 U+09C7` should result in
`U+0995 U+09CC`.

The specification is unclear on this; see https://unicode-org.atlassian.net/browse/CLDR-19218

This also updates the ldml keyboard unit test suite to support running
in full NFC mode (used in all Engine implementations) as well retaining
the NFD mode (now only used by the debugger).

Side note: the Bengali normalization failure case was picked up by the
improvements to the unit test suite, proving once again that good tests
are so valuable.

Fixes: #15491
Fixes: #15505
Follows: #15488
Relates-to: CLDR-19218
@mcdurdin mcdurdin force-pushed the fix/core/15491-15505-bengali-normalization-and-tests branch from 82083f8 to 766c699 Compare January 28, 2026 03:32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this write NFC now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, missed that, good catch!

Comment on lines 745 to 746
return EXIT_FAILURE;
return rc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return EXIT_FAILURE;
return rc;
return EXIT_FAILURE;


void print_context(std::u16string &text_store, km_core_state *&test_state, std::vector<km_core_context_item> &test_context);

bool g_beep_found = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to set g_beep_found = false; at the beginning of run_test to make things more robust.

@Meng-Heng
Copy link
Contributor

TEST_NORMALIZATION: Type dea. Copy the output and paste it into a character viewer. The output should be U+09A6 U+09CC.

@mcdurdin, could you confirm that U+09CC is suppose to be U+09CB?

@mcdurdin
Copy link
Member Author

@mcdurdin, could you confirm that U+09CC is suppose to be U+09CB?

The output is supposed to be U+09CC. Apologies, the keystrokes should be deA. My mistake. I will update the test spec too.

@keyman-server keyman-server modified the milestones: A19S21, A19S22 Jan 31, 2026
Co-authored-by: Darcy Wong <darcy_wong@sil.org>
Co-authored-by: Eberhard Beilharz <ermshiperete@users.noreply.github.com>
@Meng-Heng
Copy link
Contributor

Test Prerequisites

  1. Install Keyman v19.0.193-alpha-test-15506 & bn_ldml keyboard

Test Results

GROUP_WINDOWS:

Test Specs:

  1. macOS Sonoma
  2. Windows 11 AMD64 on Virtual Box
  3. Notepad
  4. Chrome v144.0.7559.110
  • TEST_NORMALIZATION (PASSED):
  1. Launch Notepad and Chrome URL
  2. Type deA
  3. Copy and paste the characters to https://unicode.scarfboy.com/
  4. Verify: The output is U+09A6 U+09CC.

GROUP_MAC:

Test Specs:

  1. macOS Sequoia
  2. TextEdit v1.20
  3. Chrome v144.0.7559.110
  • TEST_NORMALIZATION (PASSED):
  1. Launch TextEdit and Chrome (Google Docs)
  2. Type deA
  3. Copy and paste the characters to https://unicode.scarfboy.com/?s=%E0%A6%A6%E0%A7%8C
  4. Verify: The output is U+09A6 U+09CC.

@mcdurdin
Copy link
Member Author

mcdurdin commented Feb 3, 2026

Test Results

Tested on Ubuntu 24.04 X11 with Gnome Text Editor.

GROUP_LINUX: Test on Linux

  • TEST_NORMALIZATION (PASS): verified okay, U+09A6 U+09CC result.

@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-required User tests have not been completed label Feb 3, 2026
mcdurdin added a commit that referenced this pull request Feb 3, 2026
When normalizing, we need to stop processing on an NFC boundary, not an
NFD boundary, to support normalizations such as in Bengali, where
appending `U+09D7` to a context of `U+0995 U+09C7` should result in
`U+0995 U+09CC`.

The specification is unclear on this; see https://unicode-org.atlassian.net/browse/CLDR-19218

This also updates the ldml keyboard unit test suite to support running
in full NFC mode (used in all Engine implementations) as well retaining
the NFD mode (now only used by the debugger).

Side note: the Bengali normalization failure case was picked up by the
improvements to the unit test suite, proving once again that good tests
are so valuable.

Fixes: #15491
Fixes: #15505
Follows: #15488
Cherry-pick-of: #15506
Relates-to: CLDR-19218
Base automatically changed from fix/core/15487-bksp to master February 4, 2026 02:53
@mcdurdin mcdurdin merged commit 730d9df into master Feb 4, 2026
23 checks passed
@mcdurdin mcdurdin deleted the fix/core/15491-15505-bengali-normalization-and-tests branch February 4, 2026 02:53
@github-project-automation github-project-automation bot moved this from Todo to Done in Keyman Feb 4, 2026
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 19.0.198-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

bug(core): normalization does not run on whole Bengali syllable in actions_normalize chore(core): ldml.cpp tests only working on NFD, need NFC mode

5 participants