Tokenizers questions and ... proposals?

Hello @ggerganov ,

Thanks for the great project. I have been trying to include the Jina Embedding models into `llama.cpp` as you can see in https://github.com/ggerganov/llama.cpp/pull/6826.

I have been successful on having it run for most of the models that Jina offers, (English, Spanish and German) but I cannot have it for `Chinese`.

I havee seen that the issue comes from the `tokenization` part of the model and I have been digging more into the code for `llama.cpp` as the one from `tokenizers` in HuggingFace.

I have some questions that I will try to place here.

1st. - How to know which tokenizer needs to be used for each model. For instance, I see that the `SPM` and `BPE` tokenizers here seems to work quite similarly but there are some discrepancies.


2nd. - I have seen that the problem from the `Chinese` model when it comes to the differences in output compared to the usage of `transformers` comes from the fact that the model uses some `Normalizers` and `PreTokenizers` that are very hard to configure in `llama.cpp`.

I wonder if there would be need to do some refactoring in the tokenizer to enable the decoupling of the tokenizing logic with the surrouding normalization code, plus some options to have a reacher mapping of the tokenizer options in `transformers` and in `llama.cpp`.

I am not sure if my observations here make any sense or I am just missusing the project or missunderstanding some of the concepts.

Thank you for the great work and happy to bring some help.

 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizers questions and ... proposals? #6980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizers questions and ... proposals? #6980

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions