-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate BudouX and consider using it for CJK+ segmentation #1803
Comments
an existing Rust port: https://github.com/sg0hsmt/budoux-rs |
BudouX segmenter rules seems to be different of others. When noun with preposition, particle and etc, it returns combined sentence. Using dictionary (but this cannot follow all words and data is too big) or morpheme
Using BudouX (data size is small even if JSON)
This example is that Also, BodouX has zh-Hans data too, but we need zh-Hant too. I guess that this is worth to add this with |
Of course, when using 128B utf-8 text, it is slower than dictionary in ICU4C and ICU4X (480,071 ns/iter vs 946 ns/iter) |
Thanks! It looks like there may be low-hanging fruit to increase the performance: makotokato/budoux-rs#1 |
BudouX is a new project out of Google for CJ segmentation with a focus on data size reduction. We should investigate it as an option for ICU4X.
https://github.com/google/budoux
The docs say that it may also be scalable to other languages. I think we should continue with the LSTM approach for Thai/Lao/Khmer/Burmese, but it would be worth investigating what BudouX could bring to the table in that case.
CC @hiroyuki-komatsu @aethanyc @makotokato
The text was updated successfully, but these errors were encountered: