AutoTokenizer - add `from_model_name` method #13623

patrickvonplaten · 2021-09-17T10:45:43Z

This PR adds a new .from_model_name(...) class method to AutoTokenizer.

In some example scripts we would like to train the tokenizer from scratch or create it from scratch - e.g. see: #13620. Depending on which model is used the corresponding tokenizer class should be loaded. I think the cleanest way to pick the corresponding tokenizer class is to go over the model names, such as "bert", "gpt2" etc...

sgugger

This looks like a nice new addition! I prefer using this to relying on the auto mapping to find the classes from the names.

LysandreJik

Why not leverage the configuration here and do from_config like in your docstring? I understand the tokenizer gets very limited useful information from the configuration (even if it could potentially leverage the pad token ID, vocab size etc) but I think this makes more sense:

from transformers import AutoModel, AutoTokenizer, BertConfig

config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_config(config)

while this is less user-friendly:

from transformers import AutoModel, AutoTokenizer, BertConfig

config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_model_name("bert")
# or
tokenizer = AutoTokenizer.from_model_name(config.model_type)

I also don't understand how the tokenizer is supposed to behave from the code example, especially relative to the configuration of the tokenizer: how is the vocab size calculated? From the default of the bert model, from the vocab size? How would a user customize it? I would add an example in your docstring

src/transformers/models/auto/tokenization_auto.py

patrickvonplaten · 2021-09-17T14:43:18Z

tokenizer = AutoTokenizer.from_config(config)

Hmm, I think from_model_name(....) is the better option here because:

The use cases of AutoTokenizerr.from_config(...) was mainly intended for example scripts where the tokenizer is trained/created in the script itself. In such a case, we can't leverage the model config since it's the creation of the tokenizer that defines vocab_size, pad_token_id, .... => e.g. I would like to enable to create a generic ASR tokenizer in such a script:

transformers/examples/research_projects/wav2vec2/run_common_voice.py

Line 358 in da8beaa

tokenizer = Wav2Vec2CTCTokenizer(

. Note that in this script the vocab file is created in the example.
We can't do AutoTokenizer.from_config(...) as we also will have to pass a vocab_file, merges file, etc... -> so the function cannot have the same API as AutoModel
I think we were planning on seperating the model config and the tokenizer config instead of having a single config => I think it can be confusing then to do AutoTokenizer.from_config(...) as one doesn't know whether a tokenizer config or model config should be passed here. Essentially I was aiming for such an API:

config = AutoConfig.from_pretrained("facebook/wav2vec2-pretrained")  # <- pretrained speech models have no information about vocab size, etc.... yet

# ... create tokenizer vocab files

tokenizer = AutoTokenizer.from_model_name(config.model_type, vocab_file="path/to/vocab/file")

config["vocab_size"] = len(tokenizer)

model = AutoModelForCTC(config)  # <- now create model

I haven't thought too much about the API for the "NLP" tokenizers in this case, but couldn't one use tokenizers to train a tokenizer and then load it into a transformers tokenizer class via AutoTokenizer.from_model_name("roberta", tokenizer_file="path/to/tokenizer")? Or does the API doesn't make much sense here?

Keen to hear your thoughts here @LysandreJik @SaulLu

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

LysandreJik · 2021-09-17T19:49:21Z

I understand your points. I agree that it is confusing to have AutoTokenizer.from_config(...) rely on the model configuration.

A last comment: from_model_name implies that a tokenizer is 1-to-1 linked with a model, but that's not the case. For example, what of BERTweet, which has a BertweetTokenizer but its model is a RoBERTa, or GPT-Neo/GPT-J/iBERT that use a GPT2Tokenizer under the hood? I'd opt for having from_name, or from_type if the goal is to always leverage config.model_type instead.

SaulLu

Thanks for the addition @patrickvonplaten !

I like the idea of being able to get a tokenizer with AutoTokenizer of a certain type without having to find a model checkpoint of the type we want on the hub.

The only small question I have is: won't finding out how to spell the name be too complex for a user? I think this also ties in a bit with @LysandreJik 's comment about the method name telling the user what syntax to use. For example, to get the XLMProphetNetTokenizer tokenizer, should they write XLMProphetNet, xlm-prophetnet or xlm_prophetnet (and how they should know which one is right)?

I think that what could help users would be to have an error message when model_name does not correspond to any key of TOKENIZER_MAPPING_NAMES which proposes the closest names according to the Levenshtein distance (I made a little demo in this google colab notebook on what could give such an error message).

I wonder if this isn't a feature we'd like to have for AutoConfig too. After that, it would probably be worth waiting for a direct request from users.

SaulLu · 2021-09-20T10:27:54Z

src/transformers/models/auto/tokenization_auto.py

+
+        tokenizer_class_name, tokenizer_class_name_fast = TOKENIZER_MAPPING_NAMES[model_name]
+
+        if use_fast and tokenizer_class_name_fast is not None:


A small suggestion here, wouldn't we return an error message when the user explicitly requests a slow or fast version and it doesn't exist for the requested type?

Think a warning might be a good idea! In AutoTokenizer.from_pretrained(..., use_fast=True) we always fall back to the slow version if fast doesn't exist, so I think we should keep the same design here cc @sgugger @LysandreJik WDYT?

Yes, since use_fast=True is the default, the user usually did not explicitly request the tokenizer, so there should not be a warning (they scare users).

Another design could be to have the parameter default to None, and when it's None do the behavior we have right now (try fast then slow). When it's an explicit boolean, we could then either issue a warning or an error message. An error message would probably be nicer and would allow us to get rid of some assert tokenizer.is_fast in the examples.

In all cases, we should have the same design in AutoTokenizer class methods obviously.

Wdyt?

Think it'd be a good idea to default to None and raise an warning/error if the boolean is explicitly set to True and there is no fast tokenizer. I think I'd opt for a warning though (maybe saying that this behavior will result in an error in the future) as it would be breaking for AutoTokenizer.from_pretrained(...)

patrickvonplaten · 2021-09-20T17:05:34Z

After talking to Lysandre offline a better solution might actually be to leverage the AutoTokenizer.from_pretrained(...) method directly.

The main reason for this PR is to make fine-tuning of pretrained speech recognition models easier. I want to be able to write a fine-tuning script just by using Auto.... classes and no Wav2Vec2....

To give a bit more context:

In speech recognition, a pretrained speech model (Wav2Vec2, HuBERT) has the following files defined after pretraining:

the pretrained model, which can be loaded with AutoModelForCTC.from_pretrained(...)
the feature extractor, which can be loaded with AutoFeatureExtractor.from_pretrained(...)
the configuration file, which needs to be adapted before being used to load the model

Note that no tokenizer is needed for pretraining and has to be defined on-the-fly for fine-tuning.

In fine-tuning the workflow is then as follows:

# 1. load the config
config = AutoConfig.from_pretrained("facebook/wav2vec2-base")

# 2. create the vocabulary
characters = training_data["unique_chars"]
vocab_dict = {v, k for k,v in enumerate(characters)}

# 3. save the config locally to be able to create the tokenzier
with open("vocab.json", "w") as f:
     f.write(json.dump(vocab_dict))

# 4. IMPORTANT: Now I want to instantiate a new tokenizer using the just created `"vocab.json"` and passing the model type.
# The tokenizer will be uploaded to the model repo after fine-tuning. that will be uploaded to the new model that uses `vocab.json`

# <- so possible solutions to this are
# 1. AutoTokenizer.from_name(config.model_type)
# 2. AutoTokenizer.from_config(config)
# 3. AutoTokenizer.from_pretrained("./", config=config)
# 4. AutoTokenizer.from_pretrained("./", model_type=config.model_type)

Case # 1. was the proposal of this PR.
I'm not a fan of # 2. as it's not clear to me whether config is a tokenizer config or a model config
Case # 3. actually already works out of the box and wouldn't require any changes. Given that the API is already there I think it's fine to use it. However, here I'm also a bit worried that people will assume that the config will actually overwrite also attributes in the tokenizer config. E.g. I could see people setting the pad_token_id, vocab_size in the config and then assume the tokenizer will have the same vocab_size and pad_token_id after calling AutoTokenizer.from_pretrained("./", config=config)
Case # 4. would require some minimal changes to AutoTokenizer (just adding a kwargs.pop("model_type")) essentially

=> Having talked to @LysandreJik a bit more about it, I think it does indeed not necessary make sense to create a whole new API of the AutoTokenizer and that it should be enough to just use the already existing .from_pretrained(...).

Also it seems like in the case of training a tokenizer from scratch, one would simply always make use of PreTrainedTokenizer(...) instead of using the AutoTokenizer class.

@SaulLu - I very much agree that we should not assume that the user knows how to spell the model name. Rather, it should be extracted from the configuration file.

I would be in favor of Case #4 as I think it's the cleanest approach, but given the API for Case #3 already exist I guess I would also be fine with using this one. Still think it's quite confusing though to pass the model config to the AutoTokenizer taking into consideration that we have a tokenizer config with tokenizer_config.json... Think AutoTokenizer.from_pretrained(..., config=config) is also not really used, no?

What do you guys think?

(Sorry for the long discussions here!)

LysandreJik · 2021-09-20T18:01:09Z

Thanks for the write-up @patrickvonplaten. I think case 4 is the cleanest approach too, even if I still dislike model_type - how about tokenizer_type?

This is a nitpick though, happy to go with either.

sgugger · 2021-09-20T18:03:05Z

I don't mind adding the API for case 4, but I share Lysandre's view on the name of the argument. tokenizer_type would be better indeed.

patrickvonplaten · 2021-09-21T09:00:13Z

Superseded by #13668

finish

e11850f

patrickvonplaten requested review from sgugger and LysandreJik September 17, 2021 10:55

sgugger approved these changes Sep 17, 2021

View reviewed changes

LysandreJik reviewed Sep 17, 2021

View reviewed changes

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved

Update src/transformers/models/auto/tokenization_auto.py

3786a3f

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

patrickvonplaten requested a review from SaulLu September 17, 2021 14:45

SaulLu reviewed Sep 20, 2021

View reviewed changes

patrickvonplaten mentioned this pull request Sep 21, 2021

[AutoTokenizer] Allow creation of tokenizers by tokenizer type #13668

Merged

5 tasks

patrickvonplaten closed this Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer - add `from_model_name` method #13623

AutoTokenizer - add `from_model_name` method #13623

patrickvonplaten commented Sep 17, 2021

sgugger left a comment

LysandreJik left a comment •

edited

Loading

patrickvonplaten commented Sep 17, 2021 •

edited

Loading

LysandreJik commented Sep 17, 2021

SaulLu left a comment •

edited

Loading

SaulLu Sep 20, 2021

patrickvonplaten Sep 20, 2021

sgugger Sep 20, 2021

patrickvonplaten Sep 20, 2021

patrickvonplaten commented Sep 20, 2021 •

edited

Loading

LysandreJik commented Sep 20, 2021

sgugger commented Sep 20, 2021

patrickvonplaten commented Sep 21, 2021


		tokenizer_class_name, tokenizer_class_name_fast = TOKENIZER_MAPPING_NAMES[model_name]

		if use_fast and tokenizer_class_name_fast is not None:

AutoTokenizer - add from_model_name method #13623

AutoTokenizer - add from_model_name method #13623

Conversation

patrickvonplaten commented Sep 17, 2021

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment • edited Loading

Choose a reason for hiding this comment

patrickvonplaten commented Sep 17, 2021 • edited Loading

LysandreJik commented Sep 17, 2021

SaulLu left a comment • edited Loading

Choose a reason for hiding this comment

SaulLu Sep 20, 2021

Choose a reason for hiding this comment

patrickvonplaten Sep 20, 2021

Choose a reason for hiding this comment

sgugger Sep 20, 2021

Choose a reason for hiding this comment

patrickvonplaten Sep 20, 2021

Choose a reason for hiding this comment

patrickvonplaten commented Sep 20, 2021 • edited Loading

LysandreJik commented Sep 20, 2021

sgugger commented Sep 20, 2021

patrickvonplaten commented Sep 21, 2021

AutoTokenizer - add `from_model_name` method #13623

AutoTokenizer - add `from_model_name` method #13623

LysandreJik left a comment •

edited

Loading

patrickvonplaten commented Sep 17, 2021 •

edited

Loading

SaulLu left a comment •

edited

Loading

patrickvonplaten commented Sep 20, 2021 •

edited

Loading