Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XLMRoberta embedding models #8658

Merged
merged 2 commits into from
Aug 6, 2024

Conversation

iamlemec
Copy link
Collaborator

This adds support for XLMRoberta embedding models, such as BAAI/bge-m3. The inference is done entirely through regular BERT, and tokenization uses the new T5 Unigram work. There is some modification necessary at conversion time:

  • Slightly reshuffle the special token order (as is done in HF)
  • Shift the position embedding matrix down by 1+pad_token_id slots

The position embedding shift has the added bonus that it usually makes the embedding matrix a nice even power of 2 again. It also means we don't introduce added complexity to the inference code due to a pretty unusual design choice.

Also needed to tweak the Unigram tokenizer very slightly. Now the Unigram tokenize method only reverses the tokens it's just added, rather than the entire output list. This didn't matter before, but it's needed if you're adding a BOS token first.

@Oliver-Y I just saw that you're working on this as well. If you have the time/interest, please take a look!

@github-actions github-actions bot added the python python script changes label Jul 23, 2024
@Oliver-Y
Copy link

Sounds good will take a look and get back to you!

@Oliver-Y
Copy link

Oliver-Y commented Jul 24, 2024

Yea, I definitely misconfigured the tokenizer in the conversion since I was largely following along with the BERT implementation without realizing the tokenizer changed. The T5 Unigram tokenizer fixed the tokenizer bug I was previously dealing with. This might be a simple question but I keep seeing Unigram, BPE, SPM, and WPM everywhere but was under the impression that Unigram and BPE were the underlying algorithms that SPM and WPM used so what exactly are the distinctions here?

For the position encodings, I added it into the llama.cpp inference under the llama_set_inputs function. Would this still be needed if modify tensors in conversion script accounts for it?

I did a preliminary test with this new conversion script on my local version but couldn't get the embeddings out of llama-embeddings to match up with the HF implementation. How did it fare with the bge-m3 model?

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
Comment on lines 2550 to 2558
# realign tokens (see HF tokenizer code)
tokens = [b'<s>', b'<pad>', b'</s>', b'<unk>'] + tokens[3:-1]
scores = [0.0, -10000.0, 0.0, -10000.0] + scores[3:-1]
toktypes = [
SentencePieceTokenTypes.CONTROL,
SentencePieceTokenTypes.CONTROL,
SentencePieceTokenTypes.CONTROL,
SentencePieceTokenTypes.UNKNOWN,
] + toktypes[3:-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link the specific "HF tokenizer code"? All I could find was https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/bindings/python/scripts/convert.py#L227-L236 which seems to use a score of 0.0 for all 4 of these tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order switch and shift comes from the comment here: https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L126.

And actually, I'm not sure where those scores come from. I think the scores in the original model are 0.0 there, so maybe that's the best way to go.

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@iamlemec
Copy link
Collaborator Author

@Oliver-Y Yeah, it's super confusing. They're all kind of similar, and some of the difference are only relevant at training time anyway. It looks like they really hashed it out in this Reddit thread: https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/

If you modify the tensors at conversion time, there's no need to change llama_set_inputs since the position IDs actually start from zero, just like usual. It also means you don't need to make a new architecture and can just use plain BERT.

I'm getting the numbers to match up between HF and here for both bge-m3 and multilingual-e5-large. In either case, make sure you're running on the same device and precision, since the numbers can diverge a good bit. Also, make sure you're doing mean pooling and normalizing for multilingual-e5-large.

@iamlemec
Copy link
Collaborator Author

Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing token_matcher and user_defined_token_matcher on each call is quite costly. It might be a good idea to pre-compute these in llama_load_vocab. Does that seem reasonable @fairydreaming?

@Oliver-Y Oliver-Y mentioned this pull request Jul 24, 2024
4 tasks
@ExtReMLapin
Copy link
Contributor

What the the performance when comparing python transformers to llama.cpp run ?

@iamlemec
Copy link
Collaborator Author

@ExtReMLapin Doing 128 sequences with an average length of 75 tokens, I'm getting 180ms with transformers and 25s with llama.cpp. If you do all the tokens in one big sequence (with llama-tokenize) it only takes 500ms with llama.cpp.

@ExtReMLapin
Copy link
Contributor

Thanks for the answer, unless there is a typo somewhere, I expected it to be faster on llama.cpp

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@ggerganov ggerganov requested a review from compilade July 25, 2024 06:44
@fairydreaming
Copy link
Collaborator

Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing token_matcher and user_defined_token_matcher on each call is quite costly. It might be a good idea to pre-compute these in llama_load_vocab. Does that seem reasonable @fairydreaming?

@iamlemec I agree, this is definitely one of the bottlenecks that should be eliminated. Another is naive trie implementation that should be at some point replaced with something more efficient.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 1, 2024
@fairydreaming
Copy link
Collaborator

Is there anything preventing this from being merged?

@ggerganov ggerganov merged commit cdd1889 into ggerganov:master Aug 6, 2024
55 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Aug 7, 2024
* add conversion for bge-m3; small fix in unigram tokenizer

* clean up and simplify XLMRoberta conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants