Description
Hello @ggerganov ,
Thanks for the great project. I have been trying to include the Jina Embedding models into llama.cpp
as you can see in #6826.
I have been successful on having it run for most of the models that Jina offers, (English, Spanish and German) but I cannot have it for Chinese
.
I havee seen that the issue comes from the tokenization
part of the model and I have been digging more into the code for llama.cpp
as the one from tokenizers
in HuggingFace.
I have some questions that I will try to place here.
1st. - How to know which tokenizer needs to be used for each model. For instance, I see that the SPM
and BPE
tokenizers here seems to work quite similarly but there are some discrepancies.
2nd. - I have seen that the problem from the Chinese
model when it comes to the differences in output compared to the usage of transformers
comes from the fact that the model uses some Normalizers
and PreTokenizers
that are very hard to configure in llama.cpp
.
I wonder if there would be need to do some refactoring in the tokenizer to enable the decoupling of the tokenizing logic with the surrouding normalization code, plus some options to have a reacher mapping of the tokenizer options in transformers
and in llama.cpp
.
I am not sure if my observations here make any sense or I am just missusing the project or missunderstanding some of the concepts.
Thank you for the great work and happy to bring some help.