Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate BudouX and consider using it for CJK+ segmentation #1803

Open
sffc opened this issue Apr 14, 2022 · 7 comments
Open

Investigate BudouX and consider using it for CJK+ segmentation #1803

sffc opened this issue Apr 14, 2022 · 7 comments
Labels
C-segmentation Component: Segmentation help wanted Issue needs an assignee question Unresolved questions; type unclear

Comments

@sffc
Copy link
Member

sffc commented Apr 14, 2022

BudouX is a new project out of Google for CJ segmentation with a focus on data size reduction. We should investigate it as an option for ICU4X.

https://github.com/google/budoux

The docs say that it may also be scalable to other languages. I think we should continue with the LSTM approach for Thai/Lao/Khmer/Burmese, but it would be worth investigating what BudouX could bring to the table in that case.

CC @hiroyuki-komatsu @aethanyc @makotokato

@sffc sffc added question Unresolved questions; type unclear C-segmentation Component: Segmentation labels Apr 14, 2022
@makotokato
Copy link
Member

@echeran
Copy link
Contributor

echeran commented May 26, 2022

an existing Rust port: https://github.com/sg0hsmt/budoux-rs

@makotokato
Copy link
Member

BudouX segmenter rules seems to be different of others. When noun with preposition, particle and etc, it returns combined sentence.

Using dictionary (but this cannot follow all words and data is too big) or morpheme

今日 / は / いい / 天気 / です。

Using BudouX (data size is small even if JSON)

今日は / いい / 天気です。

This example is that 今日 (Today) and (This is a particle for subject) are two words as strict word rule. But BudouX is one word. But "word segment" is ambiguous in Japanese, so both will be acceptable as Japanese.

Also, BodouX has zh-Hans data too, but we need zh-Hant too.

I guess that this is worth to add this with feature=bodoux since CJ dictionary is too big?

@makotokato
Copy link
Member

Of course, when using 128B utf-8 text, it is slower than dictionary in ICU4C and ICU4X (480,071 ns/iter vs 946 ns/iter)

@makotokato
Copy link
Member

https://github.com/makotokato/budoux-rs

@sffc
Copy link
Member Author

sffc commented Jun 22, 2022

Thanks! It looks like there may be low-hanging fruit to increase the performance: makotokato/budoux-rs#1

@makotokato
Copy link
Member

Also, mozilla/standards-positions#877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation help wanted Issue needs an assignee question Unresolved questions; type unclear
Projects
None yet
Development

No branches or pull requests

3 participants