Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DDLs in complex script segmentation models #3411

Open
sffc opened this issue May 4, 2023 · 0 comments
Open

Support DDLs in complex script segmentation models #3411

sffc opened this issue May 4, 2023 · 0 comments
Assignees
Labels
C-segmentation Component: Segmentation S-epic Size: Major project (create smaller child issues) T-bug Type: Bad behavior, security, privacy

Comments

@sffc
Copy link
Member

sffc commented May 4, 2023

The segmentation models in ICU4X (and ICU4C) are trained on the most widely used language in each script (Han, Thai, Khmer, Lao, and Myanmar). They do not work very well for digitally disadvantaged languages (DDLs) that share those same scripts, such as Cantonese (Han script), So (Thai script), and Shan (Myanmar script).

Since we are now able to use ML models that can carry context throughout an entire string, it should be possible to train a model that can accurately find breakpoints for an arbitrary string in a given script. Basically, the ML model for segmentation will learn how to do language detection at the same time.

@sffc sffc added T-bug Type: Bad behavior, security, privacy C-segmentation Component: Segmentation S-epic Size: Major project (create smaller child issues) labels May 4, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-epic Size: Major project (create smaller child issues) T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

No branches or pull requests

2 participants