-
Notifications
You must be signed in to change notification settings - Fork 31.5k
[Mistral Tokenizers] Fix tokenizer detection
#42389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "mistral", | ||
| "mistral3", | ||
| "voxstral", | ||
| "voxtral", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed a typo here
| elif version.parse(transformers_version) > version.parse("4.57.2"): | ||
| return tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we won't need this fix for the newest versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it explicit with 5.0.0<=
| # Expose the `fix_mistral_regex` flag on the tokenizer when provided, even if no correction is applied. | ||
| if "fix_mistral_regex" in init_kwargs: | ||
| setattr(tokenizer, "fix_mistral_regex", init_kwargs["fix_mistral_regex"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before but now it needs to bypass our two safety checks that indicate that we do not need to fix mistral
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| return False | ||
|
|
||
| if _is_local or is_base_mistral(pretrained_model_name_or_path): | ||
| is_official_mistral_tokenizer = is_base_mistral(pretrained_model_name_or_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot call the hub when is_local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, fixed it explicitly checking with _is_local now, good point
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks the test was much needed 😢
Can you just isolate this as a patch_mixtral_regex func
… into fix-mistral-detection
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, llama |
|
Any chance of issuing a new release, please. Breaking the usage of local caches is quite a big regression, at least for me. |
|
Yes we were just waiting on the CI sir |
* fix * sanity check * style * comments * make it v5 explicit * make explicit fixes possible in local tokenizers * remove hub usage on local * fix * extend test for no config case * move mistral patch outside to separate fn * fix local path only * add a tes * make sure test does not pass before this PR * styling * make sure it exists * fix * fix * rename * up * last nit i hope lord --------- Co-authored-by: Arthur <arthur.zucker@gmail.com>
As per title, now checks for the model type properly and adds sanity check for v5+
Fixes #42374
Fixes #42378
Fixes #42369
Closes #42379
Closes #42388