Skip to content

Conversation

@MilkClouds
Copy link
Contributor

What does this PR do?

This PR fixes a bug in the default initialization of image_rows and image_cols in the Idefics3 and SmolVLM processors.

Problem

The original code incorrectly used len(text) to determine the size of the default image_rows and image_cols lists:

image_rows = inputs.pop("rows", [[0] * len(text)])
image_cols = inputs.pop("cols", [[0] * len(text)])

This creates a single list with length equal to the number of text samples in the batch, which is incorrect. The correct behavior should create a list for each sample, where each inner list has length equal to the number of image tokens in that specific sample.

Solution

Changed the default initialization to count image tokens per sample:

image_rows = inputs.pop("rows", [[0] * sample.count(self.image_token) for sample in text])
image_cols = inputs.pop("cols", [[0] * sample.count(self.image_token) for sample in text])

This ensures that:

  1. Each sample in the batch gets its own list
  2. Each list has the correct length matching the number of image tokens in that sample

Impact

This fix affects:

  • Idefics3Processor in src/transformers/models/idefics3/processing_idefics3.py
  • SmolVLMProcessor in src/transformers/models/smolvlm/processing_smolvlm.py

The bug would have caused issues when processing batches with multiple samples or samples with multiple images, as the default values would have incorrect dimensions.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp (multimodal models)

@MilkClouds
Copy link
Contributor Author

MilkClouds commented Oct 26, 2025

p.s. this issue seems to be ignored before because return_row_col_info was always given to True to ImageProcessor! I've met error raised when I used non-SmolVLMImageProcessor for image processor of SmolVLMProcesor.

@MilkClouds MilkClouds marked this pull request as ready for review October 26, 2025 12:47
@github-actions github-actions bot requested review from molbap and yonigozlan October 26, 2025 12:47
@MilkClouds
Copy link
Contributor Author

For reviewer's information: this PR is ready for review!

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for fixing. Suggested a minor improvement, but overall LGTM!

Comment on lines 289 to 290
image_rows = inputs.pop("rows", [[0] * sample.count(self.image_token) for sample in text])
image_cols = inputs.pop("cols", [[0] * sample.count(self.image_token) for sample in text])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use n_images_in_text defined above here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the lines as you said!

Comment on lines 175 to 180
image_rows = (
image_rows if image_rows is not None else [[0] * sample.count(self.image_token) for sample in text]
)
image_cols = (
image_cols if image_cols is not None else [[0] * sample.count(self.image_token) for sample in text]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can define image_rows and image_cols before the call to expand_text_with_image_tokens using n_images_in_text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I changed the lines as you said

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: idefics3, smolvlm

@yonigozlan
Copy link
Member

Perfect thanks! I'll merge once the CI passes

@yonigozlan yonigozlan enabled auto-merge (squash) November 4, 2025 15:32
@yonigozlan yonigozlan merged commit 26fca86 into huggingface:main Nov 4, 2025
15 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yonigozlan pushed a commit to yonigozlan/transformers that referenced this pull request Nov 7, 2025
…SmolVLM processors (huggingface#41871)

* Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors

* Fix default initialization of image_rows and image_cols in Idefics3 and SmolVLM processors
Abdennacer-Badaoui pushed a commit to Abdennacer-Badaoui/transformers that referenced this pull request Nov 10, 2025
…SmolVLM processors (huggingface#41871)

* Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors

* Fix default initialization of image_rows and image_cols in Idefics3 and SmolVLM processors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants