-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Bert Embedding output from embedding.cpp vs llama.cpp server #5801
Comments
Maybe related to #5796 |
I think so. hopefully will be fixed by that. |
#5796 did NOT fix this issue. |
Can you check the cosine distance of vector produced by embedding.cpp vs server.cpp? Also maybe try without GPU offloading? |
I tried without GPU offloading, got the same output. As for the cosine distance, I calculated cosine distance between word from embedding.cpp output: from server.cpp: |
There's currently a refactoring on server code, maybe this will be fixed: #5882 |
It looks like this is actually a tokenization issue. I'm seeing the output of Second, it looks like something is up with Edit: If you force it to add a BOS token and turn off special token processing, the tokenization comes out correct. And in that case the embedding numbers are correct too, though they're not normalized, so they won't look the same as the output from |
Yes, the llama.cpp/examples/server/server.cpp Lines 471 to 477 in e04e04f
And this seems to tokenize incorrectly. Not sure if this is somehow a problem with the vocab or if we simply need to turn off We should fix this and the normalization after we merge #5882 |
Trying to figure out what's up with |
As I posted above, the embedding I got from |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
System: Mac M2 Max, OS version Sonoma 14.2.1
llama.cpp version: the latest main branch as of today -- Feb 29 2024
Steps to reproduce:
python convert-hf-to-gguf.py --outfile minilm.gguf --outtype f16 all-MiniLM-L6-v2
output:
./server -ngl 99 -m minilm.gguf --port 8019 --host 0.0.0.0 --embedding
Output:
Expected Behavior: the embedding from these 2 approaches should yield the same output
Actual Behavior: As you can see, the output embedding looks completely different from the one from step 3, not only the values, but the scales are different too.
=============================================================
And by the way, the embedding output I get from step 3 is almost the same with the one I got from using sentence_transformer python library, for example:
This indicates that the model conversion works correctly.
I think there's something wrong with the Bert Embedding of server mode.
The text was updated successfully, but these errors were encountered: