Investigate BudouX and consider using it for CJK+ segmentation #1803

sffc · 2022-04-14T06:47:31Z

BudouX is a new project out of Google for CJ segmentation with a focus on data size reduction. We should investigate it as an option for ICU4X.

https://github.com/google/budoux

The docs say that it may also be scalable to other languages. I think we should continue with the LSTM approach for Thai/Lao/Khmer/Burmese, but it would be worth investigating what BudouX could bring to the table in that case.

CC @hiroyuki-komatsu @aethanyc @makotokato

makotokato · 2022-04-14T08:05:09Z

Reference: https://unicode-org.atlassian.net/browse/ICU-21699

echeran · 2022-05-26T17:40:07Z

an existing Rust port: https://github.com/sg0hsmt/budoux-rs

makotokato · 2022-06-16T06:51:36Z

BudouX segmenter rules seems to be different of others. When noun with preposition, particle and etc, it returns combined sentence.

Using dictionary (but this cannot follow all words and data is too big) or morpheme

今日 / は / いい / 天気 / です。

Using BudouX (data size is small even if JSON)

今日は / いい / 天気です。

This example is that 今日 (Today) and は (This is a particle for subject) are two words as strict word rule. But BudouX is one word. But "word segment" is ambiguous in Japanese, so both will be acceptable as Japanese.

Also, BodouX has zh-Hans data too, but we need zh-Hant too.

I guess that this is worth to add this with feature=bodoux since CJ dictionary is too big?

makotokato · 2022-06-16T07:25:51Z

Of course, when using 128B utf-8 text, it is slower than dictionary in ICU4C and ICU4X (480,071 ns/iter vs 946 ns/iter)

makotokato · 2022-06-22T00:21:26Z

https://github.com/makotokato/budoux-rs

sffc · 2022-06-22T00:55:37Z

Thanks! It looks like there may be low-hanging fruit to increase the performance: makotokato/budoux-rs#1

makotokato · 2024-09-17T09:10:18Z

Also, mozilla/standards-positions#877

sffc added question Unresolved questions; type unclear C-segmentation Component: Segmentation labels Apr 14, 2022

nciric mentioned this issue Apr 19, 2022

Transliteration/Segmentation bindings for external implementations #1809

Open

sffc added this to the ICU4X 1.1 milestone May 26, 2022

sffc added the help wanted Issue needs an assignee label May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate BudouX and consider using it for CJK+ segmentation #1803

Investigate BudouX and consider using it for CJK+ segmentation #1803

sffc commented Apr 14, 2022 •

edited

Loading

makotokato commented Apr 14, 2022

echeran commented May 26, 2022

makotokato commented Jun 16, 2022

makotokato commented Jun 16, 2022

makotokato commented Jun 22, 2022

sffc commented Jun 22, 2022

makotokato commented Sep 17, 2024

Investigate BudouX and consider using it for CJK+ segmentation #1803

Investigate BudouX and consider using it for CJK+ segmentation #1803

Comments

sffc commented Apr 14, 2022 • edited Loading

makotokato commented Apr 14, 2022

echeran commented May 26, 2022

makotokato commented Jun 16, 2022

Using dictionary (but this cannot follow all words and data is too big) or morpheme

Using BudouX (data size is small even if JSON)

makotokato commented Jun 16, 2022

makotokato commented Jun 22, 2022

sffc commented Jun 22, 2022

makotokato commented Sep 17, 2024

sffc commented Apr 14, 2022 •

edited

Loading