[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

tomaarsen · 2025-12-02T16:00:11Z

What does this PR do?

Add return_dict to get_text_features & get_image_features methods to allow returning 'BaseModelOutputWithPooling'

Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods:

2d outputs,
3d outputs,
lists of 2d outputs (due to non-matching shapes),
existing 'return_attentions' resulting in returning 2-tuple,
existing 'return_dict' resulting in returning 3-tuples (???),
high quality image embeddings,
low quality image embeddings,
deepstack image embeddings,
etc. etc. etc.

And I only went through like 70-80% of all architectures with get_image_features before I gave up.

Standardisation of all of these sounds like a lost cause. cc @zucchini-nlp I'm curious about your thoughts here. When I did some preliminary research, I only ran into a handful of cases, and I figured we'd be able to reformat them all into one format, but I'm not sure anymore. I added # NOTE: @Tom ... where I figured we might have big problems with standardisation.

For get_text_features it's a lot simpler, there's only one architecture (blip-2) that differs from all others.

I haven't started on get_audio_features and get_video_features, but there's not too much of a point if we can't get get_image_features normalized.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @ArthurZucker @Cyrilvallez

Tom Aarsen

…ModelOutputWithPooling' Added to all architectures except blip-2, which has a much different structure here. It uses 'Blip2TextModelWithProjection' to get these embeddings/features, but this class isn't as simple to use

…eModelOutputWithPooling' Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods. 2d outputs, 3d outputs, lists of 2d outputs (due to non-matching shapes), existing 'return_attentions' resulting in returning 2-tuple, existing 'return_dict' resulting in returning 3-tuples (???), high quality image embeddings, low quality image embeddings, deepstack image embeddings, etc. etc. etc. And I only went through like 70-80% of all architectures with get_image_features before I gave up. Standardisation of all of these sounds like a lost cause.

HuggingFaceDocBuilderDev · 2025-12-02T16:08:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

We discussed this internally and decided to add last_hidden_states to all models as the last state from vision block. The pooled embeddings will stay of different shapes as is

For the last hidden state the shapes are already more standardized, with a few major options. The only special cases might be qwen-like models where each image encoding has different sequence length and thus the outputs are concatenated as length*dim

src/transformers/models/chameleon/modeling_chameleon.py

src/transformers/models/emu3/modeling_emu3.py

src/transformers/models/glm46v/modeling_glm46v.py

src/transformers/models/instructblipvideo/modeling_instructblipvideo.py

src/transformers/models/kosmos2/modeling_kosmos2.py

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py

Should be fine though, as that 'get_video_features' doesn't live on the AutoModel class, but the AutoModelForConditionalGeneration class

tomaarsen · 2025-12-15T09:22:12Z

The initial work on all 4 modalities is done, with a handful of exceptions. There's about 2 or 3 breaking architectures, specifically architectures that already supported return_dict and return_attentions. Typings, docstrings, and tests still have to be added, but I'm curious if this has a chance of being merged before I continue with those.

Tom Aarsen

zucchini-nlp

Thanks a lot for the changes, I see there are a few tricky models that do not fit neatly with BaseModelOutput

To wrap it up, to msake this work firstly we need to ensure that all vision encoders are capable of returning dict in the way that PreTrainedModels do, i.e. by checking config,return_dict and returning attentions, hidden states, pooled output etc. Then we can ask get_image_features to return the same dict which was output by an encoder (optionally pooled output is updated in VLMs). That will preserve all fields of the vision encoder output

I think the current state of the PR is already doing it with a few non-standard models. I left comments under those models so lmk if that makes sense

src/transformers/models/chameleon/modeling_chameleon.py

src/transformers/models/deepseek_vl_hybrid/modeling_deepseek_vl_hybrid.py

src/transformers/models/emu3/modeling_emu3.py

src/transformers/models/fuyu/modeling_fuyu.py

zucchini-nlp · 2025-12-15T11:10:57Z

src/transformers/models/glm4v/modeling_glm4v.py

+
+        if return_dict:
+            return BaseModelOutputWithPooling(
+                last_hidden_state=hidden_states,
+                pooler_output=merged_hidden_states,
+            )
+


totally aligned with this, very needed! I think in qwen-like model, the downsample and merging are both part of the multimodal adapter. Usually in vision model the last_hidden_state is the last state after all encoder blocks and before layer norm (e.g. CLIP, SigLIP)

IMO qwen-vision needs the same format

zucchini-nlp · 2025-12-15T11:47:21Z

src/transformers/models/metaclip_2/modular_metaclip_2.py

+        return_dict (`bool`, *optional*, default to `False`):
+            Whether to return a `ModelOutput` instead of a pooled embedding.
+


let's add complete docs if they were missing for other args

src/transformers/models/ovis2/modeling_ovis2.py

src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py

src/transformers/models/t5gemma2/modeling_t5gemma2.py

src/transformers/models/video_llava/modeling_video_llava.py

…model_inputs The changes in check_model_inputs aren't the clearest/prettiest, but they work well for now.

tomaarsen · 2025-12-16T10:00:50Z

I've pushed a proposal in 9a251ce that takes this in a bit of a different direction by adopting the modern TransformersKwargs and check_model_inputs. I updated the latter to allow setting the pooler_output as the default, unless the user explicitly uses return_dict=True (which returns a ModelOutput subclass) or return_dict=None (which uses the model config's return_dict to determine whether to output a ModelOutput or the pooled embeddings).

I can extend this to more architectures, but want to get your view on this first.

Usage:

from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
import torch

model = AutoModel.from_pretrained("openai/clip-vit-large-patch14", attn_implementation="eager")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)
image_inputs = processor(images=image, return_tensors="pt")
text_inputs = processor(text=["a photo of a cat"], return_tensors="pt")
joint_inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt")

def print_output(output):
    if isinstance(output, torch.Tensor):
        print("Output is a tensor with shape:", output.shape)
    else:
        print("Output is a ModelOutput with attributes:")
        for key, value in output.items():
            if isinstance(value, torch.Tensor):
                print(f"  {key}: tensor with shape {value.shape}")
            else:
                print(f"  {key}: {type(value)}")
    print()

with torch.inference_mode():
    image_features = model.get_image_features(**image_inputs)
    print("model.get_image_features(**image_inputs) outputs:")
    print_output(image_features)

    image_features = model.get_image_features(**image_inputs, return_dict=True)
    print("model.get_image_features(**image_inputs, return_dict=True) outputs:")
    print_output(image_features)

    image_features = model.get_image_features(**image_inputs, return_dict=True, output_hidden_states=True, output_attentions=True)
    print("model.get_image_features(**image_inputs, return_dict=True, output_hidden_states=True, output_attentions=True) outputs:")
    print_output(image_features)

    text_features = model.get_text_features(**text_inputs)
    print("model.get_text_features(**text_inputs) outputs:")
    print_output(text_features)

    text_features = model.get_text_features(**text_inputs, return_dict=True)
    print("model.get_text_features(**text_inputs, return_dict=True) outputs:")
    print_output(text_features)

    text_features = model.get_text_features(**text_inputs, return_dict=True, output_hidden_states=True, output_attentions=True)
    print("model.get_text_features(**text_inputs, return_dict=True, output_hidden_states=True, output_attentions=True) outputs:")
    print_output(text_features)

Outputs:

model.get_image_features(**image_inputs) outputs:
Output is a tensor with shape: torch.Size([1, 768])

model.get_image_features(**image_inputs, return_dict=True) outputs:
Output is a ModelOutput with attributes:
  last_hidden_state: tensor with shape torch.Size([1, 257, 1024])
  pooler_output: tensor with shape torch.Size([1, 768])

model.get_image_features(**image_inputs, return_dict=True, output_hidden_states=True, output_attentions=True) outputs:
Output is a ModelOutput with attributes:
  last_hidden_state: tensor with shape torch.Size([1, 257, 1024])
  pooler_output: tensor with shape torch.Size([1, 768])
  hidden_states: <class 'tuple'>
  attentions: <class 'tuple'>

model.get_text_features(**text_inputs) outputs:
Output is a tensor with shape: torch.Size([1, 768])

model.get_text_features(**text_inputs, return_dict=True) outputs:
Output is a ModelOutput with attributes:
  last_hidden_state: tensor with shape torch.Size([1, 7, 768])
  pooler_output: tensor with shape torch.Size([1, 768])

model.get_text_features(**text_inputs, return_dict=True, output_hidden_states=True, output_attentions=True) outputs:
Output is a ModelOutput with attributes:
  last_hidden_state: tensor with shape torch.Size([1, 7, 768])
  pooler_output: tensor with shape torch.Size([1, 768])
  hidden_states: <class 'tuple'>
  attentions: <class 'tuple'>

Tom Aarsen

….._features methods This commit updates all get_text_features methods, even blip_2, which was previously not yet attempted

A handful of outliers that aren't updated yet, e.g. if there's 2+ ModelOutput classes that are viable, or the vq-based ones For context, the other modeling file classes haven't been updated with the new get_..._features format, nor have the tests

…lipvideo

tomaarsen · 2025-12-16T15:46:35Z

The Fuyu architecture doesn't have an image encoder: > Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder.

tomaarsen · 2025-12-16T17:01:46Z

I introduced a ModelOutput in f082a8e for Chameleon, although there's many different approaches we can take there. For example, the quantized_last_hidden_state and emb_loss in ChameleonVQVAE.encode is never used, but I've chosen to still return it, although it's unusual to return a loss in a ModelOutput like this. I'm curious about your thoughts on this one @zucchini-nlp. If it seems alright, then I can (presumably) copy the approach to Emu3 which uses a similar VQVAE (although that one luckily only outputs the image_tokens currently, to which I can add the last_hidden_state).

Tom Aarsen

…izes

…isionTransformer

…rning the projection attentions

A better solution might be to remove the qformer etc. calls from the get_image/video_features and run those separately in the forward methods.

… custom changes

…ad of pooler_output

github-actions · 2025-12-18T17:11:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, aria, audioflamingo3, aya_vision, blip, blip_2, chameleon, chinese_clip, clap, clip, clipseg, clvp, cohere2_vision, colqwen2

github-actions · 2025-12-18T17:29:35Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42564&sha=7af0b6

tomaarsen added 3 commits December 2, 2025 13:47

make fixup

b6d6df3

zucchini-nlp reviewed Dec 3, 2025

View reviewed changes

tomaarsen added 11 commits December 4, 2025 11:08

Ignore discrepancies for pooler_output, focus on last_hidden_state

aa51419

Update get_image_features for the missing architectures

278b068

Update all get_audio_features

3b14045

Update get_video_features, except instructblipvideo

b7e0d66

Should be fine though, as that 'get_video_features' doesn't live on the AutoModel class, but the AutoModelForConditionalGeneration class

Merge branch 'main' into feat/normalize_get_features_methods

41bcca8

Run ruff formatting

7eb89b6

Patch Glm4v VisionModel forward with BaseModelOutputWithPooling

57af63d

Patch instructblip, although backwards incompatibility stands

7285187

Patch Kosmos2 and Ovis2

fd7be52

Reformat Ovis2

3f183fd

Avoid now-deprecated return_attentions

391aac9

zucchini-nlp reviewed Dec 15, 2025

View reviewed changes

tomaarsen added 2 commits December 16, 2025 09:32

Remove NumFrames

f8c887f

Proposal to simplify get_..._features via TransformersKwargs & check_…

9a251ce

…model_inputs The changes in check_model_inputs aren't the clearest/prettiest, but they work well for now.

tomaarsen mentioned this pull request Dec 16, 2025

Add missing ModelOutput subclass return type hints #41219

Merged

5 tasks

tomaarsen added 5 commits December 16, 2025 13:22

Revert check_model_inputs, adopt can_return_tuple, accept BC on get_.…

858d9d4

….._features methods This commit updates all get_text_features methods, even blip_2, which was previously not yet attempted

Fix typo: can_return_dict -> can_return_tuple

2a64303

Update all get_audio_features, some edge cases handled (e.g. gemma3n)

00aa0f5

Update most get_video_features, some edge case remain, e.g. instructb…

1ccbf5a

…lipvideo

Patch Fuyu, just return BaseModelOutputWithPooling without pooler

78fa904

The Fuyu architecture doesn't have an image encoder: > Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder.

zucchini-nlp mentioned this pull request Dec 16, 2025

[Model] Add video input support for transformers modeling backend vllm-project/vllm#30680

Open

1 task

Introduce ModelOutput subclass for Chameleon, patch get_image_features

f082a8e

tomaarsen added 14 commits December 17, 2025 16:39

Update modeling files with new output formats for get_..._features

9ddd3b4

Update fast_vlm modeling forward from modular llava to remove image_s…

006b2a5

…izes

Merge branch 'main' into feat/normalize_get_features_methods

afd5e64

Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput

1d6639b

Replace prior return_dict with check_model_inputs on qwen2_5_vl its V…

d52def3

…isionTransformer

Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow retu…

ff67663

…rning the projection attentions

Update Emu akin to Chameleon

22522c4

Update the blip architectures with a naive fix

37a53c3

A better solution might be to remove the qformer etc. calls from the get_image/video_features and run those separately in the forward methods.

Convert remaining modulars (emu3, janus), patch emu3

440914b

Merge branch 'main' into feat/normalize_get_features_methods

b6dbddd

Patch blip test

48353a5

Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings

531321c

Remove 'copied' for blip_2, instructblip and kosmos2 as they required…

70577d2

… custom changes

Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state inste…

f6f90d6

…ad of pooler_output

Run repo-consistency

7af0b66

tomaarsen marked this pull request as ready for review December 18, 2025 17:29

tomaarsen requested a review from zucchini-nlp December 18, 2025 17:29

		return_dict (`bool`, optional, default to `False`):
		Whether to return a `ModelOutput` instead of a pooled embedding.

[draft] Add return_dict to get_(text|image|audio|video)_features methods #42564

Are you sure you want to change the base?

[draft] Add return_dict to get_(text|image|audio|video)_features methods #42564

Uh oh!

Conversation

tomaarsen commented Dec 2, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 2, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomaarsen commented Dec 15, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomaarsen commented Dec 16, 2025

Uh oh!

tomaarsen commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

[`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods #42564

tomaarsen commented Dec 16, 2025 •

edited

Loading