-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors #41871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors #41871
Conversation
…SmolVLM processors
|
p.s. this issue seems to be ignored before because |
|
For reviewer's information: this PR is ready for review! |
yonigozlan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks for fixing. Suggested a minor improvement, but overall LGTM!
| image_rows = inputs.pop("rows", [[0] * sample.count(self.image_token) for sample in text]) | ||
| image_cols = inputs.pop("cols", [[0] * sample.count(self.image_token) for sample in text]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use n_images_in_text defined above here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the lines as you said!
| image_rows = ( | ||
| image_rows if image_rows is not None else [[0] * sample.count(self.image_token) for sample in text] | ||
| ) | ||
| image_cols = ( | ||
| image_cols if image_cols is not None else [[0] * sample.count(self.image_token) for sample in text] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, you can define image_rows and image_cols before the call to expand_text_with_image_tokens using n_images_in_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I changed the lines as you said
…nd SmolVLM processors
|
[For maintainers] Suggested jobs to run (before merge) run-slow: idefics3, smolvlm |
|
Perfect thanks! I'll merge once the CI passes |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…SmolVLM processors (huggingface#41871) * Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors * Fix default initialization of image_rows and image_cols in Idefics3 and SmolVLM processors
…SmolVLM processors (huggingface#41871) * Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors * Fix default initialization of image_rows and image_cols in Idefics3 and SmolVLM processors
What does this PR do?
This PR fixes a bug in the default initialization of
image_rowsandimage_colsin the Idefics3 and SmolVLM processors.Problem
The original code incorrectly used
len(text)to determine the size of the defaultimage_rowsandimage_colslists:This creates a single list with length equal to the number of text samples in the batch, which is incorrect. The correct behavior should create a list for each sample, where each inner list has length equal to the number of image tokens in that specific sample.
Solution
Changed the default initialization to count image tokens per sample:
This ensures that:
Impact
This fix affects:
Idefics3Processorinsrc/transformers/models/idefics3/processing_idefics3.pySmolVLMProcessorinsrc/transformers/models/smolvlm/processing_smolvlm.pyThe bug would have caused issues when processing batches with multiple samples or samples with multiple images, as the default values would have incorrect dimensions.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp (multimodal models)