Support DDLs in complex script segmentation models #3411

sffc · 2023-05-04T21:39:11Z

The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).

Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.

sffc added T-bug Type: Bad behavior, security, privacy C-segmentation Component: Segmentation S-epic Size: Major project (create smaller child issues) labels May 4, 2023

sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023

sffc assigned younies May 11, 2023

sffc mentioned this issue Jul 12, 2023

Support -u-dx flag #3668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DDLs in complex script segmentation models #3411

Support DDLs in complex script segmentation models #3411

sffc commented May 4, 2023

Support DDLs in complex script segmentation models #3411

Support DDLs in complex script segmentation models #3411

Comments

sffc commented May 4, 2023