-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoTokenizer - add from_model_name
method
#13623
AutoTokenizer - add from_model_name
method
#13623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a nice new addition! I prefer using this to relying on the auto mapping to find the classes from the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not leverage the configuration here and do from_config
like in your docstring? I understand the tokenizer gets very limited useful information from the configuration (even if it could potentially leverage the pad token ID, vocab size etc) but I think this makes more sense:
from transformers import AutoModel, AutoTokenizer, BertConfig
config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_config(config)
while this is less user-friendly:
from transformers import AutoModel, AutoTokenizer, BertConfig
config = BertConfig()
model = AutoModel.from_config(config)
tokenizer = AutoTokenizer.from_model_name("bert")
# or
tokenizer = AutoTokenizer.from_model_name(config.model_type)
I also don't understand how the tokenizer is supposed to behave from the code example, especially relative to the configuration of the tokenizer: how is the vocab size calculated? From the default of the bert
model, from the vocab size? How would a user customize it? I would add an example in your docstring
Hmm, I think
config = AutoConfig.from_pretrained("facebook/wav2vec2-pretrained") # <- pretrained speech models have no information about vocab size, etc.... yet
# ... create tokenizer vocab files
tokenizer = AutoTokenizer.from_model_name(config.model_type, vocab_file="path/to/vocab/file")
config["vocab_size"] = len(tokenizer)
model = AutoModelForCTC(config) # <- now create model I haven't thought too much about the API for the "NLP" tokenizers in this case, but couldn't one use Keen to hear your thoughts here @LysandreJik @SaulLu |
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
I understand your points. I agree that it is confusing to have A last comment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition @patrickvonplaten !
I like the idea of being able to get a tokenizer with AutoTokenizer
of a certain type without having to find a model checkpoint of the type we want on the hub.
The only small question I have is: won't finding out how to spell the name be too complex for a user? I think this also ties in a bit with @LysandreJik 's comment about the method name telling the user what syntax to use. For example, to get the XLMProphetNetTokenizer
tokenizer, should they write XLMProphetNet
, xlm-prophetnet
or xlm_prophetnet
(and how they should know which one is right)?
I think that what could help users would be to have an error message when model_name
does not correspond to any key of TOKENIZER_MAPPING_NAMES
which proposes the closest names according to the Levenshtein distance (I made a little demo in this google colab notebook on what could give such an error message).
I wonder if this isn't a feature we'd like to have for AutoConfig
too. After that, it would probably be worth waiting for a direct request from users.
|
||
tokenizer_class_name, tokenizer_class_name_fast = TOKENIZER_MAPPING_NAMES[model_name] | ||
|
||
if use_fast and tokenizer_class_name_fast is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small suggestion here, wouldn't we return an error message when the user explicitly requests a slow or fast version and it doesn't exist for the requested type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think a warning might be a good idea! In AutoTokenizer.from_pretrained(..., use_fast=True)
we always fall back to the slow version if fast doesn't exist, so I think we should keep the same design here cc @sgugger @LysandreJik WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since use_fast=True
is the default, the user usually did not explicitly request the tokenizer, so there should not be a warning (they scare users).
Another design could be to have the parameter default to None, and when it's None do the behavior we have right now (try fast then slow). When it's an explicit boolean, we could then either issue a warning or an error message. An error message would probably be nicer and would allow us to get rid of some assert tokenizer.is_fast
in the examples.
In all cases, we should have the same design in AutoTokenizer class methods obviously.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it'd be a good idea to default to None
and raise an warning/error if the boolean is explicitly set to True
and there is no fast tokenizer. I think I'd opt for a warning though (maybe saying that this behavior will result in an error in the future) as it would be breaking for AutoTokenizer.from_pretrained(...)
After talking to Lysandre offline a better solution might actually be to leverage the The main reason for this PR is to make fine-tuning of pretrained speech recognition models easier. I want to be able to write a fine-tuning script just by using To give a bit more context: In speech recognition, a pretrained speech model (Wav2Vec2, HuBERT) has the following files defined after pretraining:
Note that no tokenizer is needed for pretraining and has to be defined on-the-fly for fine-tuning. In fine-tuning the workflow is then as follows: # 1. load the config
config = AutoConfig.from_pretrained("facebook/wav2vec2-base")
# 2. create the vocabulary
characters = training_data["unique_chars"]
vocab_dict = {v, k for k,v in enumerate(characters)}
# 3. save the config locally to be able to create the tokenzier
with open("vocab.json", "w") as f:
f.write(json.dump(vocab_dict))
# 4. IMPORTANT: Now I want to instantiate a new tokenizer using the just created `"vocab.json"` and passing the model type.
# The tokenizer will be uploaded to the model repo after fine-tuning. that will be uploaded to the new model that uses `vocab.json`
# <- so possible solutions to this are
# 1. AutoTokenizer.from_name(config.model_type)
# 2. AutoTokenizer.from_config(config)
# 3. AutoTokenizer.from_pretrained("./", config=config)
# 4. AutoTokenizer.from_pretrained("./", model_type=config.model_type) Case # 1. was the proposal of this PR. => Having talked to @LysandreJik a bit more about it, I think it does indeed not necessary make sense to create a whole new API of the Also it seems like in the case of training a tokenizer from scratch, one would simply always make use of @SaulLu - I very much agree that we should not assume that the user knows how to spell the model name. Rather, it should be extracted from the configuration file. I would be in favor of Case #4 as I think it's the cleanest approach, but given the API for Case #3 already exist I guess I would also be fine with using this one. Still think it's quite confusing though to pass the model config to the AutoTokenizer taking into consideration that we have a tokenizer config with What do you guys think? (Sorry for the long discussions here!) |
Thanks for the write-up @patrickvonplaten. I think case 4 is the cleanest approach too, even if I still dislike This is a nitpick though, happy to go with either. |
I don't mind adding the API for case 4, but I share Lysandre's view on the name of the argument. |
Superseded by #13668 |
This PR adds a new
.from_model_name(...)
class method toAutoTokenizer
.In some example scripts we would like to train the tokenizer from scratch or create it from scratch - e.g. see: #13620. Depending on which model is used the corresponding tokenizer class should be loaded. I think the cleanest way to pick the corresponding tokenizer class is to go over the model names, such as
"bert"
,"gpt2"
etc...