[from_pretrained] Allow tokenizer_type ≠ model_type #6995

julien-c · 2020-09-07T17:15:30Z

For an exemple usage of this PR, see the tokenizer_class attribute in this config.json: https://s3.amazonaws.com/models.huggingface.co/bert/julien-c/dummy-diff-tokenizer/config.json

Instead of a class, we could have used a tokenizer_type belonging to the set of all model_types, like "bert", etc. but it feels more restrictive, especially in case we start having tokenizer classes that are not obviously linked to a "model", like a potential "TweetTokenizer"

Context: #6129

Update: documented by @sgugger in #8152

sgugger

Not sure I fully understand the use case, but nothing against the principle of it.

sgugger · 2020-09-08T11:58:11Z

src/transformers/tokenization_auto.py

+                tokenizer_class_candidate = f"{config.tokenizer_class}Fast"
+            else:
+                tokenizer_class_candidate = config.tokenizer_class
+            tokenizer_class = globals().get(tokenizer_class_candidate)


Might be cleaner to use some of our internal list/dict containing all tokenizers, just in case there are weird things in the namespace of some users.

Yes I was wondering about that. I was wondering if by using globals() someone could even use a tokenizer that's not in the library, but I don't think so, as globals is actually locals in this scope/file if I understand correctly.

LysandreJik

Cool, this looks good to me! Thanks for working on it.

thomwolf

Very nice!

thomwolf · 2020-09-08T12:59:51Z

src/transformers/tokenization_auto.py

+                tokenizer_class_candidate = config.tokenizer_class
+            tokenizer_class = globals().get(tokenizer_class_candidate)
+            if tokenizer_class is None:
+                raise ValueError("Tokenizer class {} does not exist or is not currently imported.")


The content of the {} is missing? Or there is a magic somewhere in ValueError which fills this?

oops no it's missing

julien-c · 2020-09-08T13:01:51Z

Not sure I fully understand the use case, but nothing against the principle of it.

The idea is to prevent combinatorial explosion of "model types" when only the tokenizer is different (e.g. Flaubert, CamemBERT if we wanted to support them today)

In the future we might even want to have a few model-agnostic tokenizer classes like ByteLevelBPETokenizer (basically RobertaTokenizer), as they can be initialized pretty exhaustively from the init args stored in tokenizer_config.json

patrickvonplaten

Great!

n1t0

Nice! LGTM

julien-c · 2020-11-09T09:22:15Z

Documented by @sgugger in #8152

[from_pretrained] Allow tokenizer_type ≠ model_type

962fb63

julien-c requested review from thomwolf, patrickvonplaten, LysandreJik and sgugger September 7, 2020 17:15

LysandreJik mentioned this pull request Sep 7, 2020

Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594

Merged

julien-c mentioned this pull request Sep 7, 2020

Add new pre-trained models BERTweet and PhoBERT #6129

Merged

julien-c requested a review from n1t0 September 7, 2020 21:05

sgugger approved these changes Sep 8, 2020

View reviewed changes

LysandreJik approved these changes Sep 8, 2020

View reviewed changes

thomwolf approved these changes Sep 8, 2020

View reviewed changes

patrickvonplaten approved these changes Sep 8, 2020

View reviewed changes

n1t0 approved these changes Sep 8, 2020

View reviewed changes

LysandreJik merged commit ed71c21 into master Sep 9, 2020

LysandreJik deleted the pretrained_override_tokenizer_class branch September 9, 2020 08:23

patrickvonplaten mentioned this pull request Sep 16, 2020

ProphetNet #7157

Merged

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

[from_pretrained] Allow tokenizer_type ≠ model_type (huggingface#6995)

6a4e38c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[from_pretrained] Allow tokenizer_type ≠ model_type #6995

[from_pretrained] Allow tokenizer_type ≠ model_type #6995

Uh oh!

julien-c commented Sep 7, 2020 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

sgugger Sep 8, 2020

Uh oh!

julien-c Sep 8, 2020

Uh oh!

LysandreJik left a comment

Uh oh!

thomwolf left a comment

Uh oh!

thomwolf Sep 8, 2020

Uh oh!

julien-c Sep 8, 2020 •

edited

Loading

Uh oh!

julien-c commented Sep 8, 2020

Uh oh!

patrickvonplaten left a comment

Uh oh!

n1t0 left a comment

Uh oh!

julien-c commented Nov 9, 2020

Uh oh!

Uh oh!

[from_pretrained] Allow tokenizer_type ≠ model_type #6995

[from_pretrained] Allow tokenizer_type ≠ model_type #6995

Uh oh!

Conversation

julien-c commented Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

julien-c Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

thomwolf Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

julien-c Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julien-c commented Sep 8, 2020

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

n1t0 left a comment

Choose a reason for hiding this comment

Uh oh!

julien-c commented Nov 9, 2020

Uh oh!

Uh oh!

julien-c commented Sep 7, 2020 •

edited

Loading

julien-c Sep 8, 2020 •

edited

Loading