Skip to content

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Sep 10, 2025

What does this PR do?

Draft PR which will allow us to have a strict type validation on all processing kwargs without having to add a dataclass object for each. The idea is to keep TypedDict for hinting and dynamically adapt a TypedDict to be compatible with huggingface_hub.strict validators

This will allow us to get rid of some validations we already have in vision processing and enforce a better validation on all kwargs

For reviewers: I recommend to start from /utils/type_validators.py and processing_utils.py. The model files just fix incorrect and incomplete type hints we had

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp changed the title [WIP] Validate processing kwargs with @strict from huggingface_hub Validate processing kwargs with @strict from huggingface_hub Sep 12, 2025
@zucchini-nlp
Copy link
Member Author

@gante I think you’re the best person to review this.

One issue I ran into is with fields that rely on safe-imports (e.g. torch or PIL objects). In CI they're failing rn with errors like input is not of type ForwardRef('torch.Tensor') because Hub validators add auto-validation from type hints. And we are not evaluating the type hints with imports

I suggest we skip the auto-validation when the type hint resolves to a ForwardRef, and only run the validations explicitly defined in metadata with functions. For all other fields with basic Python built-in types, we’ll continue validating against both the type hints and any provided validation functions. WDYT?

@gante
Copy link
Member

gante commented Sep 12, 2025

Two global comments from a brief review of the PR:

  1. Forward references: have you tried decorating the class with @requires(backend=...), and not gating the import for type hints? Example: see MobileNetV2ImageProcessor -- it uses direct type hints of things that are normally gated (this wouldn't solve the case of forward references to prevent circular imports, but may solve your cases here);
  2. This PR adds a new class that applies the core of strict dataclasses to a typed dict, as you mention in the PR header. Since dataclasses and typed dict are very similar, would it be possible to replace the typed dict by a @strict dataclass? If that is indeed a possibility, it would solve the problem without the need to create (and maintain) a new abstraction. (v5 is around the corner, now is a good time to consider breaking changes :D )

Other than that, data validation 🫶

@zucchini-nlp
Copy link
Member Author

Since dataclasses and typed dict are very similar, would it be possible to replace the typed dict by a @strict dataclass?

I wanted to do that as my first option though that would mean we will have to delete typed dicts and we will lose beautiful type hinting in **kwargs. We don't list kwargs in processors currently, so we've been relying on Unpack to give users a sense of what to pass or no

Forward references: have you tried decorating the class with @requires(backend=...), and not gating the import for type hints?

Thanks, I will take a look. TBH these can be evaluated with Python but that requires us to have the package in the namespace and already imported. Sometimes the packages are imported in another file from which we import to another and so on.... Typical large codebase. So this caused issues because python couldn't resolve the package name. I will definitely try your way as well

@gante
Copy link
Member

gante commented Sep 12, 2025

I wanted to do that as my first option though that would mean we will have to delete typed dicts and we will lose beautiful type hinting in **kwargs. We don't list kwargs in processors currently, so we've been relying on Unpack to give users a sense of what to pass or no

That's a good point, and important for DevX. In that case, I'd suggest documenting this design decision in TypedDictAdapter so our future selves know why this mapping exists instead of a plain dataclass :) In the future, if we find limitations with this design, we can always go the dataclass path.

Ping me again when you're happy with a solution for the forward references, so I can nitpick a version closer to its merging state :D

@zucchini-nlp
Copy link
Member Author

@gante Updated the docs. For the second point on safe type checks:

Using @requires doesn't help much because it simply halts the import. In this case the packages are all installed and the issue is that the type annotations cannot be resolved. The typing resolution needs to know have access to all packages that are possibly in the typing, but it doesn't resolve recursively the imported object.

For ex, if we import import ImageInput (ImageInput = ["PIL.Image", "torch.tensor"]) from utilities and added hint on kwarg as query_images: ImageInput, the resolution fails to find PIL package and thus cannot evaluate. The solution is either to import explicitly import PIL, import torch or to skip the evaluation and leave ForwardRef. I opted for the second choice

I couldn't find a way to trigger python to resolve annotations using imported modules's namespaces as well, have you encountered anything similar prev?

@gante
Copy link
Member

gante commented Sep 18, 2025

@zucchini-nlp I haven't 🤔

(Then maybe the solution for the future is to add forward reference resolution to the hub code)

@zucchini-nlp
Copy link
Member Author

Yep, that is what I also wanted. Though it means we need to first wait for the next hub release and pin the version in out deps. Let me submit a PR on hub and see

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few nits 🤗

I like that adding the validators actually resulted in catching many issues with type hints and test values 🫶

Missing: minimal tests for TypedDictAdapter (e.g. that it raises an exception when it should)

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, meant to approve!

@Wauplin
Copy link
Contributor

Wauplin commented Oct 2, 2025

Hey @zucchini-nlp @gante, I have now opened huggingface/huggingface_hub#3408 to validate TypedDict schemas directly from huggingface_hub. Usage is pretty straightforward:

from typing import Annotated, TypedDict
from huggingface_hub.dataclasses import validate_typed_dict

def positive_int(value: int):
    if not value >= 0:
        raise ValueError(f"Value must be positive, got {value}")

class User(TypedDict):
    name: str
    age: Annotated[int, positive_int]

# Valid data
validate_typed_dict(User, {"name": "John", "age": 30})

Could you have a look at this PR huggingface/huggingface_hub#3408 and let me know what you think? Would be even very nice if you could install it locally and adapt this branch to check that is correctly handles validation.

Note: I suppose at some point you could even have a mixin class that automatically validates the Unpack[TypedDictStuff] type annotations in transformers classes (even though, that's story for another day 😄)

@zucchini-nlp
Copy link
Member Author

@Wauplin hi, will there be a 1.0.0.rc3 release soon which includes TypedDict for validation? Or can I merge it until the release, otherwise it is finding more new errors after rebasing 🥲

@Wauplin
Copy link
Contributor

Wauplin commented Oct 8, 2025

Hey @zucchini-nlp , sorry for the delay. The 1.0.0.rc3 is out including huggingface/huggingface_hub#3408 changes: https://pypi.org/project/huggingface-hub/1.0.0rc3/

@molbap molbap self-requested a review October 8, 2025 10:16
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive work @zucchini-nlp !! Left a couple comments

@Wauplin Wauplin self-requested a review October 8, 2025 12:11
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good from an huggingface_hub integration perspective but I left a comment regarding optional values. Let me know what you think :)

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) October 8, 2025 13:45
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, beit, bridgetower, cohere2_vision, conditional_detr, convnext, deepseek_vl, deepseek_vl_hybrid, deformable_detr, detr, dia, donut, dpt, efficientloftr, efficientnet, emu3

@zucchini-nlp zucchini-nlp merged commit 89a4115 into huggingface:main Oct 8, 2025
25 checks passed
@Wauplin
Copy link
Contributor

Wauplin commented Oct 8, 2025

Yay! Glad to see @strict finally making it officially into transformers! 🎉

omsherikar pushed a commit to omsherikar/transformers that referenced this pull request Oct 8, 2025
…face#40793)

* initial design draft

* delete

* fix a few tests

* fix

* fix the rest of tests

* common-kwargs

* why the runner complains about typing with "|"?

* revert

* forgot to delete

* update

* fix last issues

* add more detalis in docs

* pin the latest hub release

* fix tests for new models

* also fast image processor

* fix copies

* image processing ast validated

* fix more tests

* typo.and fix copies

* bump

* style

* fix some tests

* fix copies

* pin rc4 and mark all TypedDict as non-total

* delete typed dict adaptor

* address comments

* delete optionals
AhnJoonSung pushed a commit to AhnJoonSung/transformers that referenced this pull request Oct 12, 2025
…face#40793)

* initial design draft

* delete

* fix a few tests

* fix

* fix the rest of tests

* common-kwargs

* why the runner complains about typing with "|"?

* revert

* forgot to delete

* update

* fix last issues

* add more detalis in docs

* pin the latest hub release

* fix tests for new models

* also fast image processor

* fix copies

* image processing ast validated

* fix more tests

* typo.and fix copies

* bump

* style

* fix some tests

* fix copies

* pin rc4 and mark all TypedDict as non-total

* delete typed dict adaptor

* address comments

* delete optionals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants