multiple tokenizers with different filenames can save now #41837

aijadugar · 2025-10-24T03:23:11Z

What does this PR do?

This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.

The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.

This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.

Fixes #41816

Before submitting

I have read the contributor guideline.

The change was discussed in issue #41816.

I’ve tested the processor save/load logic locally with multiple tokenizers.

No documentation changes were required.

Added/verified tests for multiple sub-tokenizers loading correctly.

Who can review?

Tagging maintainers familiar with processor and tokenizer internals:

@Cyrilvallez

@ArthurZucker

AmitMY · 2025-10-24T07:49:58Z

Running the example from the issue, I get:

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
    processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
    attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
                                                                   ^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined

aijadugar · 2025-10-24T14:29:31Z

hi, @AmitMY , can you try it once again!

AmitMY · 2025-10-24T14:38:17Z

Strangely, now the error is

AttributeError: GemmaTokenizerFast has no attribute to_dict

It would be good if you add a test for your code.

Create tests/utils/test_processor_utils.py

import tempfile

from transformers.testing_utils import TestCasePlus

from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
    def test_processor_loads_separate_attributes(self):
        class OtherProcessor(ProcessorMixin):
            name = "other-processor"

            attributes = [
                "tokenizer1",
                "tokenizer2",
            ]
            tokenizer1_class = "AutoTokenizer"
            tokenizer2_class = "AutoTokenizer"

            def __init__(self,
                         tokenizer1: PreTrainedTokenizer,
                         tokenizer2: PreTrainedTokenizer
                         ):
                super().__init__(tokenizer1=tokenizer1,
                                 tokenizer2=tokenizer2)

        tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
        tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

        processor = OtherProcessor(tokenizer1=tokenizer1,
                                   tokenizer2=tokenizer2)
        assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__

        with tempfile.TemporaryDirectory() as temp_dir:
            # Save processor
            processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
            # Load processor
            new_processor = OtherProcessor.from_pretrained(temp_dir)

        assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__

AmitMY

Cool! now i guess need to make all the other tests pass...

src/transformers/processing_utils.py

… comments and serialization behavior

AmitMY

look at the diff. ideally, don't change things that are not relevant

src/transformers/processing_utils.py

Moved docstring for clarity and added it to the cast_array_to_list function.

…r save logic

AmitMY · 2025-10-26T15:39:30Z

W291 [*] Trailing whitespace
   --> src/transformers/processing_utils.py:658:38
    |
656 |         """
657 |         # shallow copy to avoid deepcopy errors
658 |         output = self.__dict__.copy()  
    |                                      ^^
659 |
660 |         # Get the kwargs in `__init__`.
    |
help: Remove trailing whitespace

F821 Undefined name `save_jinja_files`
   --> src/transformers/processing_utils.py:792:80
    |
790 |                 attribute_save_dir = os.path.join(save_directory, attribute_name)
791 |                 os.makedirs(attribute_save_dir, exist_ok=True)
792 |                 attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
    |                                                                                ^^^^^^^^^^^^^^^^
793 |             elif attribute._auto_class is not None:
794 |                 custom_object_save(attribute, save_directory, config=attribute)
    |

W291 [*] Trailing whitespace
    --> src/transformers/processing_utils.py:1423:68
     |
1421 |                 attribute_class = cls.get_possibly_dynamic_module(class_name)
1422 |
1423 |             # updated loading path for handling multiple tokenizers 
     |                                                                    ^
1424 |             attribute_path = os.path.join(pretrained_model_name_or_path, attribute_name)
1425 |             if os.path.isdir(attribute_path):
     |
help: Remove trailing whitespace

I001 [*] Import block is un-sorted or un-formatted
 --> tests/test_processor_utils.py:1:1
  |
1 | / import tempfile
2 | |
3 | | from transformers.testing_utils import TestCasePlus
4 | | from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
  | |___________________________________________________________________________^
  |
help: Organize imports

Found 4 errors.
[*] 3 fixable with the `--fix` option.

aijadugar · 2025-10-26T17:41:45Z

hellow @AmitMY , thanks for your continuous feedback and detailed reports!...

AmitMY · 2025-10-26T17:44:25Z

Tests still fail

molbap

Took a look, thanks for the contrib! agree with @yonigozlan on the issue description that we are currently revamping a lot for v5 and need to be careful about all API changes so exercising caution.

molbap · 2025-11-03T09:29:32Z

src/transformers/processing_utils.py

        return f"{self.__class__.__name__}:\n{attributes_repr}\n\n{self.to_json_string()}"

-    def save_pretrained(self, save_directory, push_to_hub: bool = False, **kwargs):
+    def save_pretrained(self, save_directory, save_jinja_files=False, push_to_hub: bool = False, **kwargs):


let's avoid adding an arg here for API consistency, and I don't think it's needed too

molbap · 2025-11-03T09:32:28Z

tests/test_processor_utils.py

does not need to be in an entirely new file, can be in test_processing_common and actually be part of the processor testing suite

molbap · 2025-11-03T10:18:32Z

tests/test_processor_utils.py

+from transformers.testing_utils import TestCasePlus
+
+
+class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):


The test case test_processor_from_and_save_pretrained should be modified to handle the multi-tokenizers case

molbap · 2025-11-03T10:25:37Z

src/transformers/processing_utils.py

+                attribute_save_dir = os.path.join(save_directory, attribute_name)
+                os.makedirs(attribute_save_dir, exist_ok=True)
+                attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)


Indeed evolla already uses the subdir approach. however I'd reserve that for multi-tokenizers models only, checking with len(tokenizer_attributes) > 1 and keeping the normal save path in other cases.

yonigozlan

Hey @aijadugar , thanks for working on this. Indeed better support is needed for multiple tokenizers, however this PR radically changes how we save processors and subprocessors config files, as it would create subfolders for all sub processors.

In order to avoid having to save too many config files or subfolders, and centralize the information, we decided that all sub processors configs (except tokenizers) should be saved as nested dicts in the preprocessor_config.json files. This is not incompatible with having multiple subprocessors of the same types, although this is not yet supported.
As for tokenizers, the solution in this PR seems ok to me, except for the fact that the base "tokenizer" would also be saved in a separate folder, which would break backward compatibility with existing models on the hub, + given that the very large majority of models use only one tokenizer, it makes more sense to me to only create subfolders for additional tokenizers, and keep the base tokenizer files at the root of the checkpoint.

The refactoring PR adds support for multiple tokenizers by default (logic is here), and should be merged shortly. I'll try to also add support for multiple sub processors other than tokenizers in this PR or in a subsequent one.

aijadugar · 2025-11-04T02:51:44Z

Thanks for the clarification @yonigozlan , I’ll wait for the v5 refactoring PR!

multiple tokenizers with different filenames can save now

73f91c5

shallow copy to avoid deepcopy errors

c2ab93f

aijadugar force-pushed the fix-multiple-tokenizers-saved branch from a9b0d95 to c2ab93f Compare October 24, 2025 18:24

aijadugar mentioned this pull request Oct 24, 2025

Fix deepcopy in ProcessorMixin.to_dict for GemmaTokenizerFast #41851

Open

AmitMY suggested changes Oct 25, 2025

View reviewed changes

aijadugar added 2 commits October 25, 2025 17:40

Fix processor save logic for multiple tokenizers and restore original…

93731c5

… comments and serialization behavior

Merge branch 'main' into fix-multiple-tokenizers-saved

c94405a

AmitMY suggested changes Oct 25, 2025

View reviewed changes

src/transformers/processing_utils.py Show resolved Hide resolved

aijadugar added 6 commits October 26, 2025 08:45

Revert unrelated edits and keep only relevant processor save logic

67c3bb7

Moved docstring for clarity and added it to the cast_array_to_list function.

Merge branch 'main' into fix-multiple-tokenizers-saved

8aca431

Restore original comments and docstrings; keep only relevant processo…

560067e

…r save logic

Restore original comments and keep only relevant processor save logic

68caa7d

Restore original comments

4b2c049

Restore original comments

9e4b141

aijadugar added 2 commits October 26, 2025 23:06

removes trailling whitespaces

1ffb4d3

import like ruff's isort rule

8203b0e

aijadugar mentioned this pull request Oct 27, 2025

Fix/processor multiple tokenizers #41879

Open

aijadugar added 2 commits October 27, 2025 20:53

Merge branch 'main' into fix-multiple-tokenizers-saved

e1ce3e7

Merge branch 'main' into fix-multiple-tokenizers-saved

182ffd2

molbap reviewed Nov 3, 2025

View reviewed changes

yonigozlan reviewed Nov 3, 2025

View reviewed changes

		from transformers.testing_utils import TestCasePlus


		class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):

multiple tokenizers with different filenames can save now #41837

Are you sure you want to change the base?

multiple tokenizers with different filenames can save now #41837

Conversation

aijadugar commented Oct 24, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

aijadugar commented Oct 24, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

aijadugar commented Oct 26, 2025

Uh oh!

AmitMY commented Oct 26, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

molbap Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

aijadugar commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants