Skip to content

model : jina-embeddings-v3 support #13693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented May 21, 2025

WIP support for jina-embeddings-v3

Work checklist

  • Model conversion
  • Model inference
  • Implement correct vocab
  • Task LoRAs conversion
  • LoRA task selection
  • LoRA prompt prefix depending on task

Fixes #12327
Fixes #9585

@github-actions github-actions bot added the python python script changes label May 21, 2025
@CISC
Copy link
Collaborator Author

CISC commented May 22, 2025

Apart from a few minor differences (unsure why) in the tokenizer test, the tokenizer.json parsing seems to work perfectly:

  1135 '`', 164721 '``', 164721 '``', 164721 '``'

vs

164721 '``', 164721 '``', 164721 '``',   1135 '`'

and

   32 '?',  85908 '?????'

vs

85908 '?????',     32 '?'

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

@ngxson @slaren When you have the time I would appreciate some feedback on how best to tackle the task LoRAs of this model.

I think the best user-experience would probably be to keep them embedded and add extra metadata for their names so they can be easily chosen via an option. However that increases the scope of this PR quite a bit as new mechanisms would need to be added to load and apply the right LoRA tensors at runtime. This seems a little excessive for just one model, but maybe it can be useful for others as well, I don't know?

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

The PR as-is now works great and produces identical embeddings as the original using transformers with no task specified, but the main selling-point of the model is using the tasks so I thinks it's important to do this right.

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

I was thinking about supporting built-in lora lately, as this is required for phi-4-multimodal. We can extend the current lora API to support this case, but eventually end-user need a way to select this (for example via llama-server). For multimodal, it can be done easily via libmtmd.

Another approach could be to add an enum of pre-defined lora types, and user code can switch it at runtime. This is based on an earlier suggestion from @ggerganov about having multiple models in the same gguf.

If I have time this weekend, I can push a draft PR on how this can be done.

@ggerganov
Copy link
Member

ggerganov commented May 23, 2025

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Btw, did you resolve the tokenization differences?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

The lora api on server is quite low-level, also downstream apps will have to explicitly set the lora accordingly to use case, which may not be a good UX overall, especially when the lora provides commonly known tasks like embeddings or reranking

For tokenization, I think built-in lora is not affected by this, as they use the same tokenizer as base model Unrelated answer

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

Another option could be to consider it as an extension to the embedding pooling selection

@ggerganov
Copy link
Member

Why can't we use the existing LoRA mechanism that is supported by llama-server?

Ok I understand - the adapters are embedded inside the GGUF file together with the model and we don't have a mechanism to load them.

@ggerganov
Copy link
Member

The less intrusive route would be to extract each LoRA into their own separate GGUF (albeit a more complicated conversion process) and make the user responsible for applying the correct one (and using the correct prompt), but that's seems like a fairly bad UX.

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Even if the adapters were embedded, the user still has to use the correct prompt. So the UX seems to be the same regardless how the LoRAs are stored?

The thinking was that the prompt could be prefixed depending on task selection (easily stored as metadata).

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Btw, did you resolve the tokenization differences?

No, it seems like a bug/difference in the UGM tokenizer...

@ggerganov I can confirm that it's an issue with the UGM tokenizer, the same thing happens with nomic-embed-text-v2-moe f.ex.

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

I looked deeper into the jina model. It is a bit confused to me though:

  1. There is only one single lora, not one lora per task as I initially thought
  2. It's unclear: in which use case, we don't want to use LoRA?

If the use case of non LoRA is not practical, maybe it's more simple to just merge the LoRA into the weight

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

1. There is only one single lora, not one lora per task as I initially thought

No, there are 5 LoRAs, but they are all in the same (4D3D) tensor.

2. It's unclear: in which use case, we don't want to use LoRA?

Not sure, just for reference I guess?

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

See here for how the correct task LoRA is loaded in transformers:
https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/custom_st.py#L130-L141

@ngxson
Copy link
Collaborator

ngxson commented May 23, 2025

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

@CISC
Copy link
Collaborator Author

CISC commented May 23, 2025

Ok I see, haven't looked at the tensor shapes. So if I understand correctly, it seems like the first 2 tasks retrieval.query and retrieval.passage are merged into 1 adapter, and the 2 tasks are switched using prompt. That's why we have 5 tasks but only 4 adapters.

No, there are 5 adapters, the tensors are shaped like this: [tasks (5), rank (4), N]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: add jina embeddings model availible convert to gguf Feature Request: Support Jina V3 arch
3 participants