Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update added tokens #1335

Merged
merged 32 commits into from
Sep 7, 2023
Merged

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Sep 1, 2023

What does this PR do?

Fixes #1334 and refactors the AddedVocabulary. Previously the when a token was already part of the vocabuary it was not added to the added_tokens_map but still added to added_tokens_map_r which is not consistent. Now it is added in both even if it already existed.

Here is a small snippet of what is now possible in python

>>> from tokenizers import AddedToken, Tokenizer
>>> token = AddedToken("HEY")
>>> tokenizer = Tokenizer.from_pretrained("gpt2")
>>> tokenizer.get_added_tokens_decoder()

>>> tokenizer.add_tokens([token])
1
>>> token.special = False
>>> tokenizer.add_tokens([token])
1 
>>> tokenizer.get_vocab_size()
50258
>>> content = tokenizer.decode([4])
'%'
>>> tokenizer.add_tokens([AddedToken('%')])
1
>>> tokenizer.get_added_tokens_decoder()
{4: AddedToken("%", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
 50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
 50257: AddedToken("HEY", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False)}
>>> tokenizer.get_vocab_size()
50258

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 1, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review September 5, 2023 17:39
@ArthurZucker ArthurZucker requested a review from Narsil September 5, 2023 17:39
// TODO ArthurZ THIS IS WRONG! We need to measure the length of the `set` because
// now some tokens can be both in the added_tokens_encoder and in the vocab
if with_added_tokens {
self.get_vocab(true).len()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use max(vocab_id) instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To account for potential holes 😢

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we use max, we have to convert u32 to usize or return usize

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

As said internally, I didn't have time to fully review, but most of my initial worries are addressed:

  • is_special_token -> special is making things more uniform.
  • The test breaking change is also fixing some other subtle bug within transformers when playing with special tokens (and we're releasing a breaking version anyway).

The rest looks OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AddedTokens loophole
3 participants