Skip to content

Conversation

@aijadugar
Copy link

What does this PR do?

This PR fixes an issue where saving a custom Processor that includes multiple sub-tokenizers of the same type caused them to overwrite each other during serialization.

The root cause was that all sub-components were being saved using the same default filenames, leading to collisions.

This update introduces unique naming and loading logic in the ProcessorMixin save/load methods, allowing processors with multiple tokenizers to be safely saved and reloaded without data loss.

Fixes #41816

Before submitting

I have read the contributor guideline.

The change was discussed in issue #41816.

I’ve tested the processor save/load logic locally with multiple tokenizers.

No documentation changes were required.

Added/verified tests for multiple sub-tokenizers loading correctly.

Who can review?

Tagging maintainers familiar with processor and tokenizer internals:

@Cyrilvallez

@ArthurZucker

@AmitMY
Copy link

AmitMY commented Oct 24, 2025

Running the example from the issue, I get:

Traceback (most recent call last):
  File "/Users/amitmoryossef/dev/sign/visual-text-decoder/example.py", line 32, in <module>
    processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/visual_text_decoder/lib/python3.12/site-packages/transformers/processing_utils.py", line 804, in save_pretrained
    attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
                                                                   ^^^^^^^^^^^^^^^^
NameError: name 'save_jinja_files' is not defined

@aijadugar
Copy link
Author

hi, @AmitMY , can you try it once again!

@AmitMY
Copy link

AmitMY commented Oct 24, 2025

Strangely, now the error is

AttributeError: GemmaTokenizerFast has no attribute to_dict

It would be good if you add a test for your code.

Create tests/utils/test_processor_utils.py

import tempfile

from transformers.testing_utils import TestCasePlus

from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
    def test_processor_loads_separate_attributes(self):
        class OtherProcessor(ProcessorMixin):
            name = "other-processor"

            attributes = [
                "tokenizer1",
                "tokenizer2",
            ]
            tokenizer1_class = "AutoTokenizer"
            tokenizer2_class = "AutoTokenizer"

            def __init__(self,
                         tokenizer1: PreTrainedTokenizer,
                         tokenizer2: PreTrainedTokenizer
                         ):
                super().__init__(tokenizer1=tokenizer1,
                                 tokenizer2=tokenizer2)

        tokenizer1 = AutoTokenizer.from_pretrained("google/gemma-3-270m")
        tokenizer2 = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

        processor = OtherProcessor(tokenizer1=tokenizer1,
                                   tokenizer2=tokenizer2)
        assert processor.tokenizer1.__class__ != processor.tokenizer2.__class__

        with tempfile.TemporaryDirectory() as temp_dir:
            # Save processor
            processor.save_pretrained(save_directory=temp_dir, push_to_hub=False)
            # Load processor
            new_processor = OtherProcessor.from_pretrained(temp_dir)

        assert new_processor.tokenizer1.__class__ != new_processor.tokenizer2.__class__

Copy link

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! now i guess need to make all the other tests pass...

Copy link

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the diff. ideally, don't change things that are not relevant

@AmitMY
Copy link

AmitMY commented Oct 26, 2025

W291 [*] Trailing whitespace
   --> src/transformers/processing_utils.py:658:38
    |
656 |         """
657 |         # shallow copy to avoid deepcopy errors
658 |         output = self.__dict__.copy()  
    |                                      ^^
659 |
660 |         # Get the kwargs in `__init__`.
    |
help: Remove trailing whitespace

F821 Undefined name `save_jinja_files`
   --> src/transformers/processing_utils.py:792:80
    |
790 |                 attribute_save_dir = os.path.join(save_directory, attribute_name)
791 |                 os.makedirs(attribute_save_dir, exist_ok=True)
792 |                 attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
    |                                                                                ^^^^^^^^^^^^^^^^
793 |             elif attribute._auto_class is not None:
794 |                 custom_object_save(attribute, save_directory, config=attribute)
    |

W291 [*] Trailing whitespace
    --> src/transformers/processing_utils.py:1423:68
     |
1421 |                 attribute_class = cls.get_possibly_dynamic_module(class_name)
1422 |
1423 |             # updated loading path for handling multiple tokenizers 
     |                                                                    ^
1424 |             attribute_path = os.path.join(pretrained_model_name_or_path, attribute_name)
1425 |             if os.path.isdir(attribute_path):
     |
help: Remove trailing whitespace

I001 [*] Import block is un-sorted or un-formatted
 --> tests/test_processor_utils.py:1:1
  |
1 | / import tempfile
2 | |
3 | | from transformers.testing_utils import TestCasePlus
4 | | from transformers import ProcessorMixin, AutoTokenizer, PreTrainedTokenizer
  | |___________________________________________________________________________^
  |
help: Organize imports

Found 4 errors.
[*] 3 fixable with the `--fix` option.

@aijadugar
Copy link
Author

hellow @AmitMY , thanks for your continuous feedback and detailed reports!...

@AmitMY
Copy link

AmitMY commented Oct 26, 2025

Tests still fail

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look, thanks for the contrib! agree with @yonigozlan on the issue description that we are currently revamping a lot for v5 and need to be careful about all API changes so exercising caution.

return f"{self.__class__.__name__}:\n{attributes_repr}\n\n{self.to_json_string()}"

def save_pretrained(self, save_directory, push_to_hub: bool = False, **kwargs):
def save_pretrained(self, save_directory, save_jinja_files=False, push_to_hub: bool = False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid adding an arg here for API consistency, and I don't think it's needed too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not need to be in an entirely new file, can be in test_processing_common and actually be part of the processor testing suite

from transformers.testing_utils import TestCasePlus


class ProcessorSavePretrainedMultipleAttributes(TestCasePlus):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case test_processor_from_and_save_pretrained should be modified to handle the multi-tokenizers case

Comment on lines +790 to +792
attribute_save_dir = os.path.join(save_directory, attribute_name)
os.makedirs(attribute_save_dir, exist_ok=True)
attribute.save_pretrained(attribute_save_dir, save_jinja_files=save_jinja_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed evolla already uses the subdir approach. however I'd reserve that for multi-tokenizers models only, checking with len(tokenizer_attributes) > 1 and keeping the normal save path in other cases.

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aijadugar , thanks for working on this. Indeed better support is needed for multiple tokenizers, however this PR radically changes how we save processors and subprocessors config files, as it would create subfolders for all sub processors.

In order to avoid having to save too many config files or subfolders, and centralize the information, we decided that all sub processors configs (except tokenizers) should be saved as nested dicts in the preprocessor_config.json files. This is not incompatible with having multiple subprocessors of the same types, although this is not yet supported.
As for tokenizers, the solution in this PR seems ok to me, except for the fact that the base "tokenizer" would also be saved in a separate folder, which would break backward compatibility with existing models on the hub, + given that the very large majority of models use only one tokenizer, it makes more sense to me to only create subfolders for additional tokenizers, and keep the base tokenizer files at the root of the checkpoint.

The refactoring PR adds support for multiple tokenizers by default (logic is here), and should be merged shortly. I'll try to also add support for multiple sub processors other than tokenizers in this PR or in a subsequent one.

@aijadugar
Copy link
Author

Thanks for the clarification @yonigozlan , I’ll wait for the v5 refactoring PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Processor saving does not work when multiple tokenizers

4 participants