-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for XLMRoberta embedding models #8658
Conversation
Sounds good will take a look and get back to you! |
Yea, I definitely misconfigured the tokenizer in the conversion since I was largely following along with the BERT implementation without realizing the tokenizer changed. The T5 Unigram tokenizer fixed the tokenizer bug I was previously dealing with. This might be a simple question but I keep seeing Unigram, BPE, SPM, and WPM everywhere but was under the impression that Unigram and BPE were the underlying algorithms that SPM and WPM used so what exactly are the distinctions here? For the position encodings, I added it into the llama.cpp inference under the llama_set_inputs function. Would this still be needed if modify tensors in conversion script accounts for it? I did a preliminary test with this new conversion script on my local version but couldn't get the embeddings out of llama-embeddings to match up with the HF implementation. How did it fare with the bge-m3 model? |
convert_hf_to_gguf.py
Outdated
# realign tokens (see HF tokenizer code) | ||
tokens = [b'<s>', b'<pad>', b'</s>', b'<unk>'] + tokens[3:-1] | ||
scores = [0.0, -10000.0, 0.0, -10000.0] + scores[3:-1] | ||
toktypes = [ | ||
SentencePieceTokenTypes.CONTROL, | ||
SentencePieceTokenTypes.CONTROL, | ||
SentencePieceTokenTypes.CONTROL, | ||
SentencePieceTokenTypes.UNKNOWN, | ||
] + toktypes[3:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you link the specific "HF tokenizer code"? All I could find was https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/bindings/python/scripts/convert.py#L227-L236 which seems to use a score of 0.0
for all 4 of these tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order switch and shift comes from the comment here: https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L126.
And actually, I'm not sure where those scores come from. I think the scores in the original model are 0.0 there, so maybe that's the best way to go.
@Oliver-Y Yeah, it's super confusing. They're all kind of similar, and some of the difference are only relevant at training time anyway. It looks like they really hashed it out in this Reddit thread: https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/ If you modify the tensors at conversion time, there's no need to change I'm getting the numbers to match up between HF and here for both |
Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing |
What the the performance when comparing python transformers to llama.cpp run ? |
@ExtReMLapin Doing 128 sequences with an average length of 75 tokens, I'm getting 180ms with |
Thanks for the answer, unless there is a typo somewhere, I expected it to be faster on llama.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
@iamlemec I agree, this is definitely one of the bottlenecks that should be eliminated. Another is naive trie implementation that should be at some point replaced with something more efficient. |
Is there anything preventing this from being merged? |
* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion
This adds support for XLMRoberta embedding models, such as
BAAI/bge-m3
. The inference is done entirely through regular BERT, and tokenization uses the new T5 Unigram work. There is some modification necessary at conversion time:1+pad_token_id
slotsThe position embedding shift has the added bonus that it usually makes the embedding matrix a nice even power of 2 again. It also means we don't introduce added complexity to the inference code due to a pretty unusual design choice.
Also needed to tweak the Unigram tokenizer very slightly. Now the Unigram
tokenize
method only reverses the tokens it's just added, rather than the entire output list. This didn't matter before, but it's needed if you're adding a BOS token first.@Oliver-Y I just saw that you're working on this as well. If you have the time/interest, please take a look!