Support DDLs in complex script segmentation models #3411
Labels
C-segmentation
Component: Segmentation
S-epic
Size: Major project (create smaller child issues)
T-bug
Type: Bad behavior, security, privacy
Milestone
The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).
Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.
The text was updated successfully, but these errors were encountered: