-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update added tokens #1335
Update added tokens #1335
Conversation
The documentation is not available anymore as the PR was closed or merged. |
// TODO ArthurZ THIS IS WRONG! We need to measure the length of the `set` because | ||
// now some tokens can be both in the added_tokens_encoder and in the vocab | ||
if with_added_tokens { | ||
self.get_vocab(true).len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use max(vocab_id) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To account for potential holes 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we use max, we have to convert u32 to usize or return usize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
As said internally, I didn't have time to fully review, but most of my initial worries are addressed:
is_special_token
->special
is making things more uniform.- The test breaking change is also fixing some other subtle bug within transformers when playing with special tokens (and we're releasing a breaking version anyway).
The rest looks OK.
What does this PR do?
Fixes #1334 and refactors the
AddedVocabulary
. Previously the when a token was already part of the vocabuary it was not added to theadded_tokens_map
but still added toadded_tokens_map_r
which is not consistent. Now it is added in both even if it already existed.Here is a small snippet of what is now possible in python