-
Notifications
You must be signed in to change notification settings - Fork 31.2k
multiple tokenizers with different filenames can save now #41837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
multiple tokenizers with different filenames can save now #41837
Conversation
|
Running the example from the issue, I get: Traceback (most recent call last):
File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined |
|
hi, @AmitMY , can you try it once again! |
|
Strangely, now the error is
It would be good if you add a test for your code. Create import tempfile
from transformers.testing_utils import TestCasePlus
from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
def test_processor_loads_separate_attributes(self):
class OtherProcessor(ProcessorMixin):
name = "other-processor"
attributes = [
"tokenizer1",
"tokenizer2",
]
tokenizer1_class = "AutoTokenizer"
tokenizer2_class = "AutoTokenizer"
def __init__(self,
tokenizer1: PreTrainedTokenizer,
tokenizer2: PreTrainedTokenizer
):
super().__init__(tokenizer1=tokenizer1,
tokenizer2=tokenizer2)
tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")
processor = OtherProcessor(tokenizer1=tokenizer1,
tokenizer2=tokenizer2)
assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__
with tempfile.TemporaryDirectory() as temp_dir:
# Save processor
processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
# Load processor
new_processor = OtherProcessor.from_pretrained(temp_dir)
assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__ |
a9b0d95 to
c2ab93f
Compare
AmitMY
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! now i guess need to make all the other tests pass...
… comments and serialization behavior
AmitMY
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the diff. ideally, don't change things that are not relevant
Moved docstring for clarity and added it to the cast_array_to_list function.
|
|
hellow @AmitMY , thanks for your continuous feedback and detailed reports!... |
|
Tests still fail |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look, thanks for the contrib! agree with @yonigozlan on the issue description that we are currently revamping a lot for v5 and need to be careful about all API changes so exercising caution.
| return f"{self.__class__.__name__}:\n{attributes_repr}\n\n{self.to_json_string()}" | ||
|
|
||
| def save_pretrained(self, save_directory, push_to_hub: bool = False, **kwargs): | ||
| def save_pretrained(self, save_directory, save_jinja_files=False, push_to_hub: bool = False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid adding an arg here for API consistency, and I don't think it's needed too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not need to be in an entirely new file, can be in test_processing_common and actually be part of the processor testing suite
| from transformers.testing_utils import TestCasePlus | ||
|
|
||
|
|
||
| class ProcessorSavePretrainedMultipleAttributes(TestCasePlus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case test_processor_from_and_save_pretrained should be modified to handle the multi-tokenizers case
| attribute_save_dir = os.path.join(save_directory, attribute_name) | ||
| os.makedirs(attribute_save_dir, exist_ok=True) | ||
| attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed evolla already uses the subdir approach. however I'd reserve that for multi-tokenizers models only, checking with len(tokenizer_attributes) > 1 and keeping the normal save path in other cases.
yonigozlan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @aijadugar , thanks for working on this. Indeed better support is needed for multiple tokenizers, however this PR radically changes how we save processors and subprocessors config files, as it would create subfolders for all sub processors.
In order to avoid having to save too many config files or subfolders, and centralize the information, we decided that all sub processors configs (except tokenizers) should be saved as nested dicts in the preprocessor_config.json files. This is not incompatible with having multiple subprocessors of the same types, although this is not yet supported.
As for tokenizers, the solution in this PR seems ok to me, except for the fact that the base "tokenizer" would also be saved in a separate folder, which would break backward compatibility with existing models on the hub, + given that the very large majority of models use only one tokenizer, it makes more sense to me to only create subfolders for additional tokenizers, and keep the base tokenizer files at the root of the checkpoint.
The refactoring PR adds support for multiple tokenizers by default (logic is here), and should be merged shortly. I'll try to also add support for multiple sub processors other than tokenizers in this PR or in a subsequent one.
|
Thanks for the clarification @yonigozlan , I’ll wait for the v5 refactoring PR! |
What does this PR do?
This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.
The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.
This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.
Fixes #41816
Before submitting
I have read the contributor guideline.
The change was discussed in issue #41816.
I’ve tested the processor save/load logic locally with multiple tokenizers.
No documentation changes were required.
Added/verified tests for multiple sub-tokenizers loading correctly.
Who can review?
Tagging maintainers familiar with processor and tokenizer internals:
@Cyrilvallez
@ArthurZucker