-
Notifications
You must be signed in to change notification settings - Fork 31.1k
[v5] 🚨Refactor subprocessors handling in processors #41633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v5] 🚨Refactor subprocessors handling in processors #41633
Conversation
| "AutoFeatureExtractor": "FeatureExtractionMixin", | ||
| "AutoImageProcessor": "ImageProcessingMixin", | ||
| "AutoVideoProcessor": "BaseVideoProcessor", | ||
| "audio_tokenizer": "DacModel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use AutoModelForAudioTokenization, no?
transformers/src/transformers/models/auto/modeling_auto.py
Lines 2248 to 2249 in e20df45
| class AutoModelForAudioTokenization(_BaseAutoModelClass): | |
| _model_mapping = MODEL_FOR_AUDIO_TOKENIZATION_MAPPING |
We want to standardize this in the future for other models as well
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the attributes class attribute was indeed a bit redundant. LGTM though you might need to rebase main and check if tests pass. I just merged a non-legacy saving PR
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but using PreTrainedTokenizer as auto-class looks funny 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes not sure if that should be the case @itazap is that expected/is it a potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering the same thing 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, why is it added here? I see smollm3 only has the Fast one
| ("video_llava", "VideoLlavaVideoProcessor"), | ||
| ("videomae", "VideoMAEVideoProcessor"), | ||
| ("vjepa2", "VJEPA2VideoProcessor"), | ||
| ("video_llama_3", "VideoLlama3VideoProcessor"), # PLACEHOLDER - needs proper video processor class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes mb, remnants from the script...
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = ["chat_template", "num_image_tokens"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, let's delete valid_kwargs wherever it was left, it's not used anywhere iirc. Can totally do in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, might as well add it here
| attributes = ["image_processor", "tokenizer", "qformer_tokenizer"] | ||
| image_processor_class = ("BlipImageProcessor", "BlipImageProcessorFast") | ||
| tokenizer_class = "AutoTokenizer" | ||
| qformer_tokenizer_class = "AutoTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need qformer_tokenizer_class? In the InstrutcBlipVideo I can see it is deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we can delete it!
|
|
||
| attributes = ["image_processor", "char_tokenizer"] | ||
| image_processor_class = ("ViTImageProcessor", "ViTImageProcessorFast") | ||
| char_tokenizer_class = "MgpstrTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, is it because they have a prefix before tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep my scripts failed on these 😅. Thanks for pointing it out!
| image_processor = EfficientNetImageProcessor.from_pretrained(self.tmpdirname) | ||
| image_processor.save_pretrained(self.tmpdirname) | ||
| tokenizer = BertTokenizer.from_pretrained(self.tmpdirname) | ||
| tokenizer.save_pretrained(self.tmpdirname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might cause some issues in tests after I merged non_legacy processor saving. We'll end up with preprocessor_config.json and processor_config.json in the same dir.
I remember some tests try to manipulate configs by loading-saving only the image processor as standalone or as part of processor, they might start failing after rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, in general I think we can standardize/simplify the processor tests a lot more in ProcessorTesterMixin. Right now it's a bit of a nightmare every time we want to make a change (I was going crazy on the default to fast image procesors PR because of this). I plan to open a PR to change the tests soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updating the tests would help a lot, and prob we could delete many oevrwritten tests as well
|
Do you want me to trigger a full CI for this PR? Better to have it , @zucchini-nlp can confirm it's helpful 😅 . Let me know once it's ready (and you want me to trigger) |
|
I think you can now @ydshieh , thank you! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Didn't see this at first. Looking at the test and remote code, I don't think that will be very breaking. Haven't seen any model in the hub without an |
CircleCI still some failing jobs, so I will wait until it's ✅ here (ping me when it's ✅ 🙏 ) |
|
Ah sorry about that @ydshieh , it should be good now! |
|
ok, i will trigger, but only share the reports on Monday. |
Sounds good thanks! |
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be a nice cleanup 😁 left a few comments!
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering the same thing 👀
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
…m/yonigozlan/transformers into remove-attributes-from-processors
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, happy to remove all those unncessary attributes!! Just a few comments, feel free to merge once you take them into account!
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, why is it added here? I see smollm3 only has the Fast one
|
[For maintainers] Suggested jobs to run (before merge) run-slow: align, altclip, aria, auto, aya_vision, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp |
* remove attributes and add all missing sub processors to their auto classes * remove all mentions of .attributes * cleanup * fix processor tests * fix modular * remove last attributes * fixup * fixes after merge * fix wrong tokenizer in auto florence2 * fix missing audio_processor + nits * Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily) * fix auto tokenizer test * add init to markup_lm * update CustomProcessor in custom_processing * remove print * nit * fix test modeling owlv2 * fix test_processing_layoutxlm * Fix owlv2, wav2vec2, markuplm, voxtral issues * add support for loading and saving multiple tokenizer natively * remove exclude_attributes from save_pretrained * modifs after review
What does this PR do?
Refactor the handling of subprocessors in processors.
attributesattribute in processors, along with all"subprocessor"_classattributesThis PR is a requirement for #41388, as otherwise we'd have to manually check that all
image_processor_classattributes are set to "AutoImageProcessor"Cc @ArthurZucker @Cyrilvallez @zucchini-nlp @molbap (and also @ydshieh as this might break some parts of the CI 👀, although I checked that all processor tests are passing still, except kosmos2.5 but that's because of a PIL.UnidentifiedImageError ;)).
Update: I'm seeing some tests breaking in
test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀Update 2: It looks like it's not really a problem. The only edge case that will break is if a custom processor was defined by inheriting from
ProcessorMixin, without overriding__init__. ProcessorMixin used to have "feature_extractor" and "tokenizer" attributes by default, now it doesn't (which makes more sense imo)Fixed the tests by modifying the custom processor on the hub to add an init.
🚨Breaking change:
For example: