Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 #25224

Merged
merged 36 commits into from
Aug 17, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Aug 1, 2023

What does this PR do?

Modifies Llama and T5 other sentencepiece based tokenizer will follow.

Previous behaviour is always possible with tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = True)

The goal of transformers's wrapping around sentencepiece

To clarify, we want to:

  • be able to choose the behaviour of the special/added tokens. This means handling the stripping, encoding and decoding of such tokens
  • allow users to easily add new tokens, with tokenenizer.add_tokens(...) instead of having to load the protobuf file, modify the vocab, save it and reload the sentencepiece processor.

The current and past problems with our wrappers

Let's use both T5 and Llama as reference models. Currently, we do not mimic the behaviour of adding words to the actual sentencepiece vocabulary. This is an issue for anyone expecting (and rightfully) that adding tokens does not modify the behaviour of the model.

Adding a word to sentencepiece's vocab

This can be done using: (source)

>>> # wget https://huggingface.co/huggyllama/llama-7b/resolve/main/tokenizer.model
>>> from sentencepiece import sentencepiece_model_pb2 as model
>>> import sentencepiece as spm
>>> sp_model = model.ModelProto()
>>> sp_model.ParseFromString(open('tokenizer.model', 'rb').read())
>>> token = "your_token"
>>> sp_model.pieces.add(piece=f"{token}",score=0.0,type=model.ModelProto.SentencePiece.USER_DEFINED,)
>>> with open('new.model', 'wb') as f:
...     f.write(sp_model.SerializeToString())

then load the sp_model:

>>> sp_model = spm.SentencePieceProcessor()
>>> sp_model.Load('new.model')

Then, try the following :

>>> sp_model.encode("your_tokenHello", out_type=str)
["_", "your_token", "Hello"]

Adding a word to a `PretrainedTokenizer

This can be done using tokenizer.add_tokens(["your_token"]). It is a lot simpler indeed.
But the output you will get is:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = True, use_fast = False)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["your_token", "_Hello"]

>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", legacy = False, use_fast = False)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["your_token", "Hello"]

This is because we always split the text on the added tokens, and give the text on the left and right to the sentencepiece model. But, most sentencepiece models add a prefix space _ (the SPIECE_UNDERLINE character). Thus, when the transformers tokenizers splits "your_tokenHello", it encode your_token with the tokenizer.added_tokens_encoder and thus does not add a prefix space, and then encode Hello with the sentencepiece model, which adds a prefix space and thus outputs _Hello.

Other missmatches:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

TLDR; this shows the only way we can actually and properly handle added tokens and sentencepiece. We have to disable automatic prefix addition, and always encode with a token that is part of the vocab at the beginning to properly encode the first token, whether it has a prefix space or not. Yes this is dirty and sad, but the previous fix was removing the extra space, which was cleaner but had a corner cases #25176.

The same issue happens with fast tokenizers:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast = True)
>>> tokenizer.add_tokens(["your_token"])
>>> tokenizer.tokenize("your_tokenHello")
["_your_token", "Hello"]

>>> tokenizer.add_tokens(["your_token_special"], True)
>>> tokenizer.tokenize("your_token_specialHello")
['your_token_special', '▁Hello']

Another issue 😈

So, here, the issue is that before the special token, even if there is no rstrip or lstrip (both are set to False), we have very strange behaviours:

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast = True)

>>> tokenizer.tokenize("<s>inform<s>")
# prefix space is eaten
['<s>', '▁inform', '<s>']

>>> tokenizer.tokenize("<s>inform <s>")
# prefix space is not eaten for the second <s>
['<s>', '▁inform', '▁', '<s>']

>>> tokenizer.tokenize(" <s>inform <s>")
# prefix space is not eaten for the second <s>
['▁▁', '<s>', '▁inform', '▁', '<s>']

>>> tokenizer.tokenize(" <s>inform <s> ")
# prefix space is not eaten for the first <s>, extra space added (known)
['▁▁', '<s>', '▁inform', '▁', '<s>', '▁▁']

>>> tokenizer.tokenize("inform <s> ")
# prefix space is added to inform
['▁inform', '▁', '<s>', '▁▁']

Note that tokenizer.convert_tokens_to_ids("▁▁") = 259 while tokenizer.convert_tokens_to_ids("▁") = 29871
Also if we add a prefix space to special tokens the beginning, we are probably gonna break a lot of things

Copy link
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help for reviewers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous test values were not really good, with this update it makes more sense

@@ -534,15 +555,19 @@ def test_remove_extra_whitespaces(self):
input_ids = self.tokenizer.encode(" . Hello")
self.assertEqual(input_ids, [7, 4, 156, 86, 20])
sp_encode = self.tokenizer.sp_model.encode(" . Hello")
self.assertEqual(input_ids, sp_encode)
self.assertEqual(input_ids, [7] + sp_encode)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually add the _ (spiece underline)

Comment on lines 369 to 371
text = self.unk_token + text
tokens = self.sp_model.encode(text, out_type=str)
return tokens[self.unk_token_length :]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the Hack:

  • all spm models have a tokenizer. Whether or not it is in the sentencepiece vocab does not matter.
  • we need to do this because since add_dummy_prefix = False the sentencpiece model always ALWAYS strips any SPIECE_UNDERLINE. So sp_model.encode(SPIECE_UNDERLINE + "Hello" , out_type=str) will give [Hel,llo] instead of [_Hel, llo].
  • previously, we removed added extra space. This is okay, but fails for words that should be split like inform. What happened before was that we would tokenize as _inform then remove _ and we have inform. But, the actual tokenization of inform is in,form and inform is not part of the vocab!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 1, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review August 1, 2023 09:56
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. As usual with breaking changes, can you put three 🚨 in the title and show in the description how to enable the past behavior for users who want it?

@ArthurZucker ArthurZucker changed the title [SPM] Finish fix spm models 🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 Aug 1, 2023
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this v. tricky problem 🤗

src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved
src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved
src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved
tests/models/t5/test_tokenization_t5.py Outdated Show resolved Hide resolved
ArthurZucker and others added 5 commits August 2, 2023 10:00
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@ArthurZucker
Copy link
Collaborator Author

Will fix the prefixing of special tokens!

@faaany
Copy link
Contributor

faaany commented Aug 15, 2023

@ArthurZucker any update to this PR?

@ArthurZucker
Copy link
Collaborator Author

Hey @faaany, I am updating it right now!

@ArthurZucker
Copy link
Collaborator Author

Reverted the changes as adding proper support for add_prefix_space is actully questionable. The usecase is already wrong as you should be reverse looking for ids not strings. See #24846 (adding prefix space was almost never done properly as the decoders were not updated as well)

@ArthurZucker
Copy link
Collaborator Author

pinging @sgugger for a final review !

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Would be nice to make the two tests skipped smaller so that they pass.

@ArthurZucker
Copy link
Collaborator Author

Will do so in a follow up PR!

@ArthurZucker ArthurZucker merged commit b4d5548 into huggingface:main Aug 17, 2023
3 checks passed
@ArthurZucker ArthurZucker deleted the finish-fix-spm-models branch August 18, 2023 10:44
@zhacmsra
Copy link

I have added the legacy=True as "enc = AutoTokenizer.from_pretrained(model_path, legacy = True, use_fast=False)" but I have still gotten an error, which is a "Not a string" error. Anyone can give a hint what is going on here?

截屏2023-08-20 21 40 35

@ArthurZucker
Copy link
Collaborator Author

@zhacmsra the issue is in loading the vocabulary file, not 100% sure it's related to this. Can you open a new Issue with a reproducer please?

@zhacmsra
Copy link

Hi Arthur, you are correct. I figured out that it is not related to this PR. The networking problem broke the input model and result in error inputs. Sorry for this trouble. Thank you for the kind and timely response.

blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
* fix EVERYTHING

* more fixes

* ⚗️⚗️ Tokenizer magic ⚗️⚗️

* wrong value but test passes for the TODO

* update

* updat

* safe protobuf import?

* style

* non gated repo

* update

* fixup

* Update src/transformers/models/llama/tokenization_llama.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/llama/tokenization_llama.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/t5/test_tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* nits

* fix t5 too

* use assert equal

* fix llama decoding

* nits on t5

* fixup

* only remove the prefix space, not other spaces

* more deconding tests and more todos

* fix CI as well

* fixup

* skip failing test on CI (its tf its ok)

* skip test_subword_regularization_tokenizer that is also crashing on the CI for TF

* update llama

* revert good fixes

* fixup

* empty

* explain why we need to encode with an additional token

* better warning?

* nits

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants