Validate processing kwargs with @strict from huggingface_hub #40793

zucchini-nlp · 2025-09-10T12:39:29Z

What does this PR do?

Draft PR which will allow us to have a strict type validation on all processing kwargs without having to add a dataclass object for each. The idea is to keep TypedDict for hinting and dynamically adapt a TypedDict to be compatible with huggingface_hub.strict validators

This will allow us to get rid of some validations we already have in vision processing and enforce a better validation on all kwargs

For reviewers: I recommend to start from /utils/type_validators.py and processing_utils.py. The model files just fix incorrect and incomplete type hints we had

HuggingFaceDocBuilderDev · 2025-09-10T12:51:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-09-12T15:30:33Z

@gante I think you’re the best person to review this.

One issue I ran into is with fields that rely on safe-imports (e.g. torch or PIL objects). In CI they're failing rn with errors like input is not of type ForwardRef('torch.Tensor') because Hub validators add auto-validation from type hints. And we are not evaluating the type hints with imports

I suggest we skip the auto-validation when the type hint resolves to a ForwardRef, and only run the validations explicitly defined in metadata with functions. For all other fields with basic Python built-in types, we’ll continue validating against both the type hints and any provided validation functions. WDYT?

gante · 2025-09-12T16:36:05Z

Two global comments from a brief review of the PR:

Forward references: have you tried decorating the class with @requires(backend=...), and not gating the import for type hints? Example: see MobileNetV2ImageProcessor -- it uses direct type hints of things that are normally gated (this wouldn't solve the case of forward references to prevent circular imports, but may solve your cases here);
This PR adds a new class that applies the core of strict dataclasses to a typed dict, as you mention in the PR header. Since dataclasses and typed dict are very similar, would it be possible to replace the typed dict by a @strict dataclass? If that is indeed a possibility, it would solve the problem without the need to create (and maintain) a new abstraction. (v5 is around the corner, now is a good time to consider breaking changes :D )

Other than that, data validation 🫶

zucchini-nlp · 2025-09-12T16:42:41Z

Since dataclasses and typed dict are very similar, would it be possible to replace the typed dict by a @strict dataclass?

I wanted to do that as my first option though that would mean we will have to delete typed dicts and we will lose beautiful type hinting in **kwargs. We don't list kwargs in processors currently, so we've been relying on Unpack to give users a sense of what to pass or no

Forward references: have you tried decorating the class with @requires(backend=...), and not gating the import for type hints?

Thanks, I will take a look. TBH these can be evaluated with Python but that requires us to have the package in the namespace and already imported. Sometimes the packages are imported in another file from which we import to another and so on.... Typical large codebase. So this caused issues because python couldn't resolve the package name. I will definitely try your way as well

gante · 2025-09-12T16:53:10Z

I wanted to do that as my first option though that would mean we will have to delete typed dicts and we will lose beautiful type hinting in **kwargs. We don't list kwargs in processors currently, so we've been relying on Unpack to give users a sense of what to pass or no

That's a good point, and important for DevX. In that case, I'd suggest documenting this design decision in TypedDictAdapter so our future selves know why this mapping exists instead of a plain dataclass :) In the future, if we find limitations with this design, we can always go the dataclass path.

Ping me again when you're happy with a solution for the forward references, so I can nitpick a version closer to its merging state :D

zucchini-nlp · 2025-09-16T15:44:14Z

@gante Updated the docs. For the second point on safe type checks:

Using @requires doesn't help much because it simply halts the import. In this case the packages are all installed and the issue is that the type annotations cannot be resolved. The typing resolution needs to know have access to all packages that are possibly in the typing, but it doesn't resolve recursively the imported object.

For ex, if we import import ImageInput (ImageInput = ["PIL.Image", "torch.tensor"]) from utilities and added hint on kwarg as query_images: ImageInput, the resolution fails to find PIL package and thus cannot evaluate. The solution is either to import explicitly import PIL, import torch or to skip the evaluation and leave ForwardRef. I opted for the second choice

I couldn't find a way to trigger python to resolve annotations using imported modules's namespaces as well, have you encountered anything similar prev?

gante · 2025-09-18T16:49:22Z

@zucchini-nlp I haven't 🤔

(Then maybe the solution for the future is to add forward reference resolution to the hub code)

zucchini-nlp · 2025-09-19T08:40:04Z

Yep, that is what I also wanted. Though it means we need to first wait for the next hub release and pin the version in out deps. Let me submit a PR on hub and see

…kwargs

gante

LGTM, added a few nits 🤗

I like that adding the validators actually resulted in catching many issues with type hints and test values 🫶

Missing: minimal tests for TypedDictAdapter (e.g. that it raises an exception when it should)

src/transformers/models/kosmos2/processing_kosmos2.py

src/transformers/models/smolvlm/processing_smolvlm.py

src/transformers/utils/type_validators.py

gante

oops, meant to approve!

src/transformers/utils/type_validators.py

Wauplin · 2025-10-02T13:19:28Z

Hey @zucchini-nlp @gante, I have now opened huggingface/huggingface_hub#3408 to validate TypedDict schemas directly from huggingface_hub. Usage is pretty straightforward:

from typing import Annotated, TypedDict
from huggingface_hub.dataclasses import validate_typed_dict

def positive_int(value: int):
    if not value >= 0:
        raise ValueError(f"Value must be positive, got {value}")

class User(TypedDict):
    name: str
    age: Annotated[int, positive_int]

# Valid data
validate_typed_dict(User, {"name": "John", "age": 30})

Could you have a look at this PR huggingface/huggingface_hub#3408 and let me know what you think? Would be even very nice if you could install it locally and adapt this branch to check that is correctly handles validation.

Note: I suppose at some point you could even have a mixin class that automatically validates the Unpack[TypedDictStuff] type annotations in transformers classes (even though, that's story for another day 😄)

…kwargs

zucchini-nlp · 2025-10-07T17:33:45Z

@Wauplin hi, will there be a 1.0.0.rc3 release soon which includes TypedDict for validation? Or can I merge it until the release, otherwise it is finding more new errors after rebasing 🥲

Wauplin · 2025-10-08T07:30:23Z

Hey @zucchini-nlp , sorry for the delay. The 1.0.0.rc3 is out including huggingface/huggingface_hub#3408 changes: https://pypi.org/project/huggingface-hub/1.0.0rc3/

molbap

Massive work @zucchini-nlp !! Left a couple comments

src/transformers/utils/type_validators.py

tests/models/colpali/test_processing_colpali.py

src/transformers/utils/type_validators.py

tests/models/vitmatte/test_image_processing_vitmatte.py

src/transformers/processing_utils.py

Wauplin

Looking good from an huggingface_hub integration perspective but I left a comment regarding optional values. Let me know what you think :)

src/transformers/models/aria/modular_aria.py

src/transformers/processing_utils.py

github-actions · 2025-10-08T13:46:15Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, beit, bridgetower, cohere2_vision, conditional_detr, convnext, deepseek_vl, deepseek_vl_hybrid, deformable_detr, detr, dia, donut, dpt, efficientloftr, efficientnet, emu3

Wauplin · 2025-10-08T14:31:02Z

Yay! Glad to see @strict finally making it officially into transformers! 🎉

…face#40793) * initial design draft * delete * fix a few tests * fix * fix the rest of tests * common-kwargs * why the runner complains about typing with "|"? * revert * forgot to delete * update * fix last issues * add more detalis in docs * pin the latest hub release * fix tests for new models * also fast image processor * fix copies * image processing ast validated * fix more tests * typo.and fix copies * bump * style * fix some tests * fix copies * pin rc4 and mark all TypedDict as non-total * delete typed dict adaptor * address comments * delete optionals

zucchini-nlp added 2 commits September 10, 2025 14:36

initial design draft

762b651

delete

02e22c6

zucchini-nlp added 9 commits September 10, 2025 19:15

fix a few tests

e744875

fix

63532bf

fix the rest of tests

1f62d6f

common-kwargs

c203ffd

why the runner complains about typing with "|"?

725a479

revert

d8ca683

forgot to delete

8ff15f7

update

b0e8120

fix last issues

9f761c6

zucchini-nlp changed the title ~~[WIP] Validate processing kwargs with @strict from huggingface_hub~~ Validate processing kwargs with @strict from huggingface_hub Sep 12, 2025

add more detalis in docs

f935cff

zucchini-nlp mentioned this pull request Sep 19, 2025

[type validation] skip unresolved forward ref huggingface/huggingface_hub#3376

Merged

zucchini-nlp added 8 commits September 24, 2025 16:58

pin the latest hub release

e6a77d8

merge main

01841b3

fix tests for new models

5a42630

also fast image processor

fe4ba56

fix copies

6e8d77e

image processing ast validated

ba41992

Merge remote-tracking branch 'upstream/main' into validate-processor-…

601985c

…kwargs

fix more tests

3233a70

gante reviewed Oct 1, 2025

View reviewed changes

gante approved these changes Oct 1, 2025

View reviewed changes

Wauplin reviewed Oct 2, 2025

View reviewed changes

src/transformers/utils/type_validators.py Outdated Show resolved Hide resolved

Wauplin mentioned this pull request Oct 2, 2025

Strict typed dict validator huggingface/huggingface_hub#3408

Merged

zucchini-nlp added 2 commits October 3, 2025 15:37

merge main

121931c

style

1daa883

zucchini-nlp mentioned this pull request Oct 3, 2025

[v5] Delete left traces of feature extractor #41321

Merged

zucchini-nlp added 2 commits October 7, 2025 18:24

Merge remote-tracking branch 'upstream/main' into validate-processor-…

bd902fb

…kwargs

fix some tests

b8385a2

zucchini-nlp mentioned this pull request Oct 8, 2025

Add MLlama fast image processor #41391

Merged

fix copies

69448bb

molbap self-requested a review October 8, 2025 10:16

molbap reviewed Oct 8, 2025

View reviewed changes

zucchini-nlp added 4 commits October 8, 2025 13:42

pin rc4 and mark all TypedDict as non-total

d253615

Merge branch 'main' into validate-processor-kwargs

0c52d03

delete typed dict adaptor

7a4e79f

address comments

0395b54

Wauplin self-requested a review October 8, 2025 12:11

Wauplin reviewed Oct 8, 2025

View reviewed changes

src/transformers/models/aria/modular_aria.py Outdated Show resolved Hide resolved

src/transformers/processing_utils.py Outdated Show resolved Hide resolved

delete optionals

34c9ec7

zucchini-nlp enabled auto-merge (squash) October 8, 2025 13:45

frigit to fix copies

774c260

zucchini-nlp merged commit 89a4115 into huggingface:main Oct 8, 2025
25 checks passed

Validate processing kwargs with @strict from huggingface_hub #40793

Validate processing kwargs with @strict from huggingface_hub #40793

Uh oh!

Conversation

zucchini-nlp commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 10, 2025

Uh oh!

zucchini-nlp commented Sep 12, 2025

Uh oh!

gante commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Sep 12, 2025

Uh oh!

gante commented Sep 12, 2025

Uh oh!

zucchini-nlp commented Sep 16, 2025

Uh oh!

gante commented Sep 18, 2025

Uh oh!

zucchini-nlp commented Sep 19, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Wauplin commented Oct 2, 2025

Uh oh!

zucchini-nlp commented Oct 7, 2025

Uh oh!

Wauplin commented Oct 8, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

Uh oh!

Wauplin commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zucchini-nlp commented Sep 10, 2025 •

edited

Loading

gante commented Sep 12, 2025 •

edited

Loading