-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[from_pretrained] Allow tokenizer_type ≠ model_type #6995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand the use case, but nothing against the principle of it.
tokenizer_class_candidate = f"{config.tokenizer_class}Fast" | ||
else: | ||
tokenizer_class_candidate = config.tokenizer_class | ||
tokenizer_class = globals().get(tokenizer_class_candidate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be cleaner to use some of our internal list/dict containing all tokenizers, just in case there are weird things in the namespace of some users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was wondering about that. I was wondering if by using globals()
someone could even use a tokenizer that's not in the library, but I don't think so, as globals
is actually locals in this scope/file if I understand correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this looks good to me! Thanks for working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
tokenizer_class_candidate = config.tokenizer_class | ||
tokenizer_class = globals().get(tokenizer_class_candidate) | ||
if tokenizer_class is None: | ||
raise ValueError("Tokenizer class {} does not exist or is not currently imported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content of the {}
is missing? Or there is a magic somewhere in ValueError
which fills this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops no it's missing
The idea is to prevent combinatorial explosion of "model types" when only the tokenizer is different (e.g. Flaubert, CamemBERT if we wanted to support them today) In the future we might even want to have a few model-agnostic tokenizer classes like ByteLevelBPETokenizer (basically RobertaTokenizer), as they can be initialized pretty exhaustively from the init args stored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM
For an exemple usage of this PR, see the
tokenizer_class
attribute in this config.json: https://s3.amazonaws.com/models.huggingface.co/bert/julien-c/dummy-diff-tokenizer/config.jsonInstead of a class, we could have used a
tokenizer_type
belonging to the set of allmodel_type
s, like"bert"
, etc. but it feels more restrictive, especially in case we start having tokenizer classes that are not obviously linked to a "model", like a potential "TweetTokenizer"Context: #6129
Update: documented by @sgugger in #8152