Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer - add from_model_name method #13623

Conversation

patrickvonplaten
Copy link
Contributor

This PR adds a new .from_model_name(...) class method to AutoTokenizer.

In some example scripts we would like to train the tokenizer from scratch or create it from scratch - e.g. see: #13620. Depending on which model is used the corresponding tokenizer class should be loaded. I think the cleanest way to pick the corresponding tokenizer class is to go over the model names, such as "bert", "gpt2" etc...

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice new addition! I prefer using this to relying on the auto mapping to find the classes from the names.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leverage the configuration here and do from_config like in your docstring? I understand the tokenizer gets very limited useful information from the configuration (even if it could potentially leverage the pad token ID, vocab size etc) but I think this makes more sense:

from transformers import AutoModel, AutoTokenizer, BertConfig

config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_config(config)

while this is less user-friendly:

from transformers import AutoModel, AutoTokenizer, BertConfig

config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_model_name("bert")
# or
tokenizer = AutoTokenizer.from_model_name(config.model_type)

I also don't understand how the tokenizer is supposed to behave from the code example, especially relative to the configuration of the tokenizer: how is the vocab size calculated? From the default of the bert model, from the vocab size? How would a user customize it? I would add an example in your docstring

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved
@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Sep 17, 2021

tokenizer = AutoTokenizer.from_config(config)

Hmm, I think from_model_name(....) is the better option here because:

  • The use cases of AutoTokenizerr.from_config(...) was mainly intended for example scripts where the tokenizer is trained/created in the script itself. In such a case, we can't leverage the model config since it's the creation of the tokenizer that defines vocab_size, pad_token_id, .... => e.g. I would like to enable to create a generic ASR tokenizer in such a script: . Note that in this script the vocab file is created in the example.
  • We can't do AutoTokenizer.from_config(...) as we also will have to pass a vocab_file, merges file, etc... -> so the function cannot have the same API as AutoModel
  • I think we were planning on seperating the model config and the tokenizer config instead of having a single config => I think it can be confusing then to do AutoTokenizer.from_config(...) as one doesn't know whether a tokenizer config or model config should be passed here. Essentially I was aiming for such an API:
config = AutoConfig.from_pretrained("facebook/wav2vec2-pretrained")  # <- pretrained speech models have no information about vocab size, etc.... yet

# ... create tokenizer vocab files

tokenizer = AutoTokenizer.from_model_name(config.model_type, vocab_file="path/to/vocab/file")

config["vocab_size"] = len(tokenizer)

model = AutoModelForCTC(config)  # <- now create model

I haven't thought too much about the API for the "NLP" tokenizers in this case, but couldn't one use tokenizers to train a tokenizer and then load it into a transformers tokenizer class via AutoTokenizer.from_model_name("roberta", tokenizer_file="path/to/tokenizer")? Or does the API doesn't make much sense here?

Keen to hear your thoughts here @LysandreJik @SaulLu

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@LysandreJik
Copy link
Member

I understand your points. I agree that it is confusing to have AutoTokenizer.from_config(...) rely on the model configuration.

A last comment: from_model_name implies that a tokenizer is 1-to-1 linked with a model, but that's not the case. For example, what of BERTweet, which has a BertweetTokenizer but its model is a RoBERTa, or GPT-Neo/GPT-J/iBERT that use a GPT2Tokenizer under the hood? I'd opt for having from_name, or from_type if the goal is to always leverage config.model_type instead.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition @patrickvonplaten !

I like the idea of being able to get a tokenizer with AutoTokenizer of a certain type without having to find a model checkpoint of the type we want on the hub.

The only small question I have is: won't finding out how to spell the name be too complex for a user? I think this also ties in a bit with @LysandreJik 's comment about the method name telling the user what syntax to use. For example, to get the XLMProphetNetTokenizer tokenizer, should they write XLMProphetNet, xlm-prophetnet or xlm_prophetnet (and how they should know which one is right)?

I think that what could help users would be to have an error message when model_name does not correspond to any key of TOKENIZER_MAPPING_NAMES which proposes the closest names according to the Levenshtein distance (I made a little demo in this google colab notebook on what could give such an error message).

I wonder if this isn't a feature we'd like to have for AutoConfig too. After that, it would probably be worth waiting for a direct request from users.


tokenizer_class_name, tokenizer_class_name_fast = TOKENIZER_MAPPING_NAMES[model_name]

if use_fast and tokenizer_class_name_fast is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion here, wouldn't we return an error message when the user explicitly requests a slow or fast version and it doesn't exist for the requested type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think a warning might be a good idea! In AutoTokenizer.from_pretrained(..., use_fast=True) we always fall back to the slow version if fast doesn't exist, so I think we should keep the same design here cc @sgugger @LysandreJik WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since use_fast=True is the default, the user usually did not explicitly request the tokenizer, so there should not be a warning (they scare users).

Another design could be to have the parameter default to None, and when it's None do the behavior we have right now (try fast then slow). When it's an explicit boolean, we could then either issue a warning or an error message. An error message would probably be nicer and would allow us to get rid of some assert tokenizer.is_fast in the examples.

In all cases, we should have the same design in AutoTokenizer class methods obviously.

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it'd be a good idea to default to None and raise an warning/error if the boolean is explicitly set to True and there is no fast tokenizer. I think I'd opt for a warning though (maybe saying that this behavior will result in an error in the future) as it would be breaking for AutoTokenizer.from_pretrained(...)

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Sep 20, 2021

After talking to Lysandre offline a better solution might actually be to leverage the AutoTokenizer.from_pretrained(...) method directly.

The main reason for this PR is to make fine-tuning of pretrained speech recognition models easier. I want to be able to write a fine-tuning script just by using Auto.... classes and no Wav2Vec2....

To give a bit more context:

In speech recognition, a pretrained speech model (Wav2Vec2, HuBERT) has the following files defined after pretraining:

  1. the pretrained model, which can be loaded with AutoModelForCTC.from_pretrained(...)
  2. the feature extractor, which can be loaded with AutoFeatureExtractor.from_pretrained(...)
  3. the configuration file, which needs to be adapted before being used to load the model

Note that no tokenizer is needed for pretraining and has to be defined on-the-fly for fine-tuning.

In fine-tuning the workflow is then as follows:

# 1. load the config
config = AutoConfig.from_pretrained("facebook/wav2vec2-base")

# 2. create the vocabulary
characters = training_data["unique_chars"]
vocab_dict = {v, k for k,v in enumerate(characters)}

# 3. save the config locally to be able to create the tokenzier
with open("vocab.json", "w") as f:
     f.write(json.dump(vocab_dict))

# 4. IMPORTANT: Now I want to instantiate a new tokenizer using the just created `"vocab.json"` and passing the model type.
# The tokenizer will be uploaded to the model repo after fine-tuning. that will be uploaded to the new model that uses `vocab.json`

# <- so possible solutions to this are
# 1. AutoTokenizer.from_name(config.model_type)
# 2. AutoTokenizer.from_config(config)
# 3. AutoTokenizer.from_pretrained("./", config=config)
# 4. AutoTokenizer.from_pretrained("./", model_type=config.model_type)

Case # 1. was the proposal of this PR.
I'm not a fan of # 2. as it's not clear to me whether config is a tokenizer config or a model config
Case # 3. actually already works out of the box and wouldn't require any changes. Given that the API is already there I think it's fine to use it. However, here I'm also a bit worried that people will assume that the config will actually overwrite also attributes in the tokenizer config. E.g. I could see people setting the pad_token_id, vocab_size in the config and then assume the tokenizer will have the same vocab_size and pad_token_id after calling AutoTokenizer.from_pretrained("./", config=config)
Case # 4. would require some minimal changes to AutoTokenizer (just adding a kwargs.pop("model_type")) essentially

=> Having talked to @LysandreJik a bit more about it, I think it does indeed not necessary make sense to create a whole new API of the AutoTokenizer and that it should be enough to just use the already existing .from_pretrained(...).

Also it seems like in the case of training a tokenizer from scratch, one would simply always make use of PreTrainedTokenizer(...) instead of using the AutoTokenizer class.

@SaulLu - I very much agree that we should not assume that the user knows how to spell the model name. Rather, it should be extracted from the configuration file.

I would be in favor of Case #4 as I think it's the cleanest approach, but given the API for Case #3 already exist I guess I would also be fine with using this one. Still think it's quite confusing though to pass the model config to the AutoTokenizer taking into consideration that we have a tokenizer config with tokenizer_config.json... Think AutoTokenizer.from_pretrained(..., config=config) is also not really used, no?

What do you guys think?

(Sorry for the long discussions here!)

@LysandreJik
Copy link
Member

Thanks for the write-up @patrickvonplaten. I think case 4 is the cleanest approach too, even if I still dislike model_type - how about tokenizer_type?

This is a nitpick though, happy to go with either.

@sgugger
Copy link
Collaborator

sgugger commented Sep 20, 2021

I don't mind adding the API for case 4, but I share Lysandre's view on the name of the argument. tokenizer_type would be better indeed.

@patrickvonplaten
Copy link
Contributor Author

Superseded by #13668

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants