Skip to content

[from_pretrained] Allow tokenizer_type ≠ model_type #6995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 9, 2020

Conversation

julien-c
Copy link
Member

@julien-c julien-c commented Sep 7, 2020

For an exemple usage of this PR, see the tokenizer_class attribute in this config.json: https://s3.amazonaws.com/models.huggingface.co/bert/julien-c/dummy-diff-tokenizer/config.json

Instead of a class, we could have used a tokenizer_type belonging to the set of all model_types, like "bert", etc. but it feels more restrictive, especially in case we start having tokenizer classes that are not obviously linked to a "model", like a potential "TweetTokenizer"

Context: #6129

Update: documented by @sgugger in #8152

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand the use case, but nothing against the principle of it.

tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
else:
tokenizer_class_candidate = config.tokenizer_class
tokenizer_class = globals().get(tokenizer_class_candidate)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to use some of our internal list/dict containing all tokenizers, just in case there are weird things in the namespace of some users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was wondering about that. I was wondering if by using globals() someone could even use a tokenizer that's not in the library, but I don't think so, as globals is actually locals in this scope/file if I understand correctly.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this looks good to me! Thanks for working on it.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

tokenizer_class_candidate = config.tokenizer_class
tokenizer_class = globals().get(tokenizer_class_candidate)
if tokenizer_class is None:
raise ValueError("Tokenizer class {} does not exist or is not currently imported.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of the {} is missing? Or there is a magic somewhere in ValueError which fills this?

Copy link
Member Author

@julien-c julien-c Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops no it's missing

@julien-c
Copy link
Member Author

julien-c commented Sep 8, 2020

Not sure I fully understand the use case, but nothing against the principle of it.

The idea is to prevent combinatorial explosion of "model types" when only the tokenizer is different (e.g. Flaubert, CamemBERT if we wanted to support them today)

In the future we might even want to have a few model-agnostic tokenizer classes like ByteLevelBPETokenizer (basically RobertaTokenizer), as they can be initialized pretty exhaustively from the init args stored in tokenizer_config.json

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Copy link
Contributor

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@LysandreJik LysandreJik merged commit ed71c21 into master Sep 9, 2020
@LysandreJik LysandreJik deleted the pretrained_override_tokenizer_class branch September 9, 2020 08:23
@patrickvonplaten patrickvonplaten mentioned this pull request Sep 16, 2020
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
@julien-c
Copy link
Member Author

julien-c commented Nov 9, 2020

Documented by @sgugger in #8152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants