Add KiwiTokenizer for Korean text tokenization #70

e7217 · 2025-02-20T08:06:05Z

Implement KiwiTokenizer in sparse/bm25/tokenizers.py
Add import_kiwi() utility function in utils/init.py
Supports tokenization of Korean text using kiwipiepy library

- Implement KiwiTokenizer in sparse/bm25/tokenizers.py - Add import_kiwi() utility function in utils/__init__.py - Supports tokenization of Korean text using kiwipiepy library

e7217 · 2025-03-05T07:06:58Z

@zc277584121 @codingjaguar
I created a PR two weeks ago, but there has been no progress. Could you kindly review the changes and approve it if possible? Thank you!

xiaofan-luan · 2025-03-06T07:32:45Z

Implement KiwiTokenizer in sparse/bm25/tokenizers.py

Add import_kiwi() utility function in utils/init.py

Supports tokenization of Korean text using kiwipiepy library
Thanks for the contribution

@e7217
did you by chance check does tantivy support kiwi tokenizer.
For latest milvus, this is how we do bm25 on 2.5 https://milvus.io/docs/full_text_search_with_langchain.md#Initialization-with-BM25-Function

We already support multiple different tokenizers, and working on lindera, which can be used for korea. Is kiwi able to integrate with tantivy?

e7217 · 2025-03-07T01:29:20Z

@xiaofan-luan
Thank you for your detailed response.
I have created this PR because I would like to use the BM25BuiltInFunction, which was introduced in version 2.5 as mentioned in your comments.

I noticed that the current setup uses konlpy as the tokenizer for Korean. However, after reviewing the evaluation metrics specified in the kiwi , which shows better performance and is actively maintained, I considered applying it. Additionally, it has been gaining significant attention in Korea recently.

I tried to update the ko tokenizer in lang.yaml, but as you mentioned, there are already other tokenizers in place, so I think it would be safer to take a more cautious approach.

It would be great to integrate it with tantivy, but after reviewing it, I realize that I need to learn more. If the opportunity arises, I will make an effort to apply it.

If you believe this PR is not suitable, I am happy to hold off on it. I'm not entirely sure if I understood your intention correctly, so please feel free to provide any additional feedback.
Thank you again for your insightful comments.

xiaofan-luan · 2025-03-07T08:23:28Z

@zc277584121

zc277584121 · 2025-03-07T10:16:23Z

hello, @e7217 We do not recommend using milvus.model for tokenizer. In fact, both langchain_milvus and pymlivus use milvus' built-in tokenizer. milvus.model is a user-side tokenizer, it is not recommended in the future.

We will later support Korean in milvus' built-in tokenizer, which is supported by https://github.com/lindera/lindera-tantivy

At that time, you can pass in the specified language and corresponding parameters through analyzer_params

e7217 · 2025-03-10T00:20:08Z

@zc277584121
Thank you for your reply.
I understand that Milvus does not currently support a Korean tokenizer in its built-in functions, but it may be supported in the future. So, if I want to use a Korean tokenizer, I should use the standard tokenizer for now.

Also, is there a guide or any way to contribute to adding a Korean tokenizer to Lindera? I would be happy to try contributing if I have the chance

zc277584121 · 2025-03-10T08:30:29Z

@e7217 milvus-io/milvus#39660 milvus-io/milvus#40416 You can refer and participate in these PRs. We do not recommend using milvus.model for tokenizer and BM25 in the future. But if it's an emergency workaround for you, we can merge this PR. If not, we recommend you to use or contribute to Milvus built-in BM25 analyzer.

zc277584121 · 2025-03-10T08:33:29Z

@e7217 btw, do you specifically want to use kiwi, or only a korean tokenizer?

e7217 · 2025-03-10T08:49:30Z

@zc277584121

Using a Korean tokenizer is my top priority, and I would like to use kiwi if possible.

According to the information on the page, it seems that languages other than English and Chinese are not supported, so I am looking for a solution.
https://milvus.io/docs/analyzer-overview.md

zc277584121 · 2025-03-10T09:23:37Z

@e7217 Well, in order for you to use kiwi, we can consider merging this PR, but this is used in milvus.model, but you have to know that it is an external BM25 tokenizer, there will be many inconveniences. You can refer to our old 2.4 version documentation: https://milvus.io/docs/v2.4.x/embed-with-bm25.md to leanr how to use it, but it is no longer found in the latest 2.5 version documentation. You can test your PR end-to-end based on this documentation.
As for other Korean tokenizers, and the built-in tokenizers , you can reach out in the previous PR links to discuss or participate in them.
Do you think this is okay for you?

e7217 · 2025-03-10T10:08:12Z

@zc277584121
Ah, I think I understand now. This repository has nothing to do with the built-in function. From the documentation in version 2.4, it seems that the external tokenizer is integrated into milvus.model as an abstracted embedding object, used as a utility function.

If I were to add kiwi, the only meaningful benefit would be performance improvement and the fact that it is not Java-based.

successfully installed package: konlpy
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
Traceback (most recent call last):
  File "/neo-llm-universal/example.py", line 12, in <module>
    tokens = analyzer(corpus[0])
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/pymilvus/model/sparse/bm25/tokenizers.py", line 187, in __call__
    tokens = self.tokenizer.tokenize(text)
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/pymilvus/model/sparse/bm25/tokenizers.py", line 168, in tokenize
    return Kkma().nouns(text)
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/konlpy/tag/_kkma.py", line 44, in __init__
    jvm.init_jvm(jvmpath, max_heap_size)
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/konlpy/jvm.py", line 55, in init_jvm
    jvmpath = jvmpath or jpype.getDefaultJVMPath()
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 70, in getDefaultJVMPath
    return finder.get_jvm_path()
  File "/neo-llm-universal/.venv/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 204, in get_jvm_path
    raise JVMNotFoundException("No JVM shared library file ({0}) "
jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.

While konlpy will be the default, it would be great if you could also allow using the kiwi tokenizer for cases where setting up the Java environment is burdensome.

That said, this is slightly different from the direction I initially intended. I hope it will be included in the built-in functions in the future.

Thank you for your kind and detailed response!

zc277584121 · 2025-03-11T02:11:29Z

@e7217 thank you , I will merge this PR to make kiwi as an optional alternative to the default one

Add KiwiTokenizer for Korean text tokenization

1564a18

- Implement KiwiTokenizer in sparse/bm25/tokenizers.py - Add import_kiwi() utility function in utils/__init__.py - Supports tokenization of Korean text using kiwipiepy library

zc277584121 merged commit 53cba1e into milvus-io:main Mar 11, 2025

e7217 deleted the feat/add-KiwiTokenizer branch March 12, 2025 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add KiwiTokenizer for Korean text tokenization #70

Add KiwiTokenizer for Korean text tokenization #70

Uh oh!

e7217 commented Feb 20, 2025

Uh oh!

e7217 commented Mar 5, 2025

Uh oh!

xiaofan-luan commented Mar 6, 2025

Uh oh!

e7217 commented Mar 7, 2025

Uh oh!

xiaofan-luan commented Mar 7, 2025

Uh oh!

zc277584121 commented Mar 7, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 11, 2025

Uh oh!

Uh oh!

Add KiwiTokenizer for Korean text tokenization #70

Add KiwiTokenizer for Korean text tokenization #70

Uh oh!

Conversation

e7217 commented Feb 20, 2025

Uh oh!

e7217 commented Mar 5, 2025

Uh oh!

xiaofan-luan commented Mar 6, 2025

Uh oh!

e7217 commented Mar 7, 2025

Uh oh!

xiaofan-luan commented Mar 7, 2025

Uh oh!

zc277584121 commented Mar 7, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 10, 2025

Uh oh!

e7217 commented Mar 10, 2025

Uh oh!

zc277584121 commented Mar 11, 2025

Uh oh!

Uh oh!