-
Notifications
You must be signed in to change notification settings - Fork 30
Add KiwiTokenizer for Korean text tokenization #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e7217
commented
Feb 20, 2025
- Implement KiwiTokenizer in sparse/bm25/tokenizers.py
- Add import_kiwi() utility function in utils/init.py
- Supports tokenization of Korean text using kiwipiepy library
- Implement KiwiTokenizer in sparse/bm25/tokenizers.py - Add import_kiwi() utility function in utils/__init__.py - Supports tokenization of Korean text using kiwipiepy library
@zc277584121 @codingjaguar |
@e7217 We already support multiple different tokenizers, and working on lindera, which can be used for korea. Is kiwi able to integrate with tantivy? |
@xiaofan-luan I noticed that the current setup uses konlpy as the tokenizer for Korean. However, after reviewing the evaluation metrics specified in the kiwi , which shows better performance and is actively maintained, I considered applying it. Additionally, it has been gaining significant attention in Korea recently. I tried to update the It would be great to integrate it with tantivy, but after reviewing it, I realize that I need to learn more. If the opportunity arises, I will make an effort to apply it. If you believe this PR is not suitable, I am happy to hold off on it. I'm not entirely sure if I understood your intention correctly, so please feel free to provide any additional feedback. |
hello, @e7217 We do not recommend using milvus.model for tokenizer. In fact, both langchain_milvus and pymlivus use milvus' built-in tokenizer. milvus.model is a user-side tokenizer, it is not recommended in the future. We will later support Korean in milvus' built-in tokenizer, which is supported by https://github.com/lindera/lindera-tantivy At that time, you can pass in the specified language and corresponding parameters through analyzer_params |
@zc277584121 Also, is there a guide or any way to contribute to adding a Korean tokenizer to Lindera? I would be happy to try contributing if I have the chance |
@e7217 milvus-io/milvus#39660 milvus-io/milvus#40416 You can refer and participate in these PRs. We do not recommend using milvus.model for tokenizer and BM25 in the future. But if it's an emergency workaround for you, we can merge this PR. If not, we recommend you to use or contribute to Milvus built-in BM25 analyzer. |
@e7217 btw, do you specifically want to use kiwi, or only a korean tokenizer? |
Using a Korean tokenizer is my top priority, and I would like to use kiwi if possible. According to the information on the page, it seems that languages other than |
@e7217 Well, in order for you to use kiwi, we can consider merging this PR, but this is used in milvus.model, but you have to know that it is an external BM25 tokenizer, there will be many inconveniences. You can refer to our old 2.4 version documentation: https://milvus.io/docs/v2.4.x/embed-with-bm25.md to leanr how to use it, but it is no longer found in the latest 2.5 version documentation. You can test your PR end-to-end based on this documentation. |
@zc277584121 If I were to add kiwi, the only meaningful benefit would be performance improvement and the fact that it is not Java-based. successfully installed package: konlpy
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
Traceback (most recent call last):
File "/neo-llm-universal/example.py", line 12, in <module>
tokens = analyzer(corpus[0])
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/pymilvus/model/sparse/bm25/tokenizers.py", line 187, in __call__
tokens = self.tokenizer.tokenize(text)
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/pymilvus/model/sparse/bm25/tokenizers.py", line 168, in tokenize
return Kkma().nouns(text)
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/konlpy/tag/_kkma.py", line 44, in __init__
jvm.init_jvm(jvmpath, max_heap_size)
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/konlpy/jvm.py", line 55, in init_jvm
jvmpath = jvmpath or jpype.getDefaultJVMPath()
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 70, in getDefaultJVMPath
return finder.get_jvm_path()
File "/neo-llm-universal/.venv/lib/python3.10/site-packages/jpype/_jvmfinder.py", line 204, in get_jvm_path
raise JVMNotFoundException("No JVM shared library file ({0}) "
jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly. While konlpy will be the default, it would be great if you could also allow using the kiwi tokenizer for cases where setting up the Java environment is burdensome. That said, this is slightly different from the direction I initially intended. I hope it will be included in the built-in functions in the future. Thank you for your kind and detailed response! |
@e7217 thank you , I will merge this PR to make kiwi as an optional alternative to the default one |