Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM: fixes after refactor #32907

Merged
merged 11 commits into from
Sep 10, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Aug 21, 2024

What does this PR do?

After merging #30962 I noticed that some models stopped working in multi-image/image-video setting. Also cache_position wasn't being updated accordingly after merging image and text inputs in legacy way. This PR addresses the issue and adds slow tests for that

Also, I think we should use num_image_tokens instead of trying to infer it from image size and patch size. The reason is that some image backbones add CLS (CLIP) and other don't (SigLIP). If we try to catch every edge case like this, we'll end up bloating the code, so it's better to let users set the number of tokens needed for one image after encoding with the vision tower. That way we can support different vision towers out-of-the-box

Ran slow tests to test if expanding matches legacy, the new tests and a few older tests. Also changes video llava tests as those weren't formatting the prompt correctly

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very big PR that touches a lot of modeling code in different ways. I just started reviewing but I think we should split up the changes here: switching to num_image_tokens is independent of fixing tests. This will make it easier to make sure the changes here are sensible.

A few comments:

  • Why weren't the breaks caught previously? Was it lack of tests or the test fetcher? If lack of tests, tests should be added which will catch this
  • I'm not sure about this switch to num_image_tokens -- can we guarantee this is always fixed? We need to properly handle this change in a non-breaking way
  • Slow tests should be run for all these models

src/transformers/models/llava/processing_llava.py Outdated Show resolved Hide resolved
src/transformers/models/llava/processing_llava.py Outdated Show resolved Hide resolved
Comment on lines 186 to 187
patches_height = int(math.sqrt(self.num_image_tokens))
patches_width = int(math.sqrt(self.num_image_tokens))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now assumes the image is square

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image after vision tower is a square, if i'm not mistaken? Or are there any vision backbones that have non-square format?

I am thinking of the way to not rely on the patch_size/image_size args, because it came out that some models add CLS token and some do not. But I may be still missing some edge cases

tests/models/video_llava/test_modeling_video_llava.py Outdated Show resolved Hide resolved
@zucchini-nlp
Copy link
Member Author

Oke, I will separate these into two PRs and then I have another one, which adds Generation tests in VLMs. That will be a third PR

  • Why weren't the breaks caught previously? Was it lack of tests or the test fetcher? If lack of tests, tests should be added which will catch this
    Yes, no tests for these cases were added, I added some more. I'll add one more for llava-interleave checkpoint as it uses SigLIP as backbone

  • I'm not sure about this switch to num_image_tokens -- can we guarantee this is always fixed? We need to properly handle this change in a non-breaking way
    If we didn't change any checkpoint and the latest major release still doesn't have the new processing logic, this shouldn't cause BC errors. WDYT?

  • Slow tests should be run for all these models
    Yes, will enable in each new PR

@zucchini-nlp
Copy link
Member Author

To be reviewed and merged after #33168

@amyeroberts
Copy link
Collaborator

@zucchini-nlp OK - just ping when you want another review!

Comment on lines +143 to +150
prompt_strings = text
if image_inputs:
if self.patch_size is None or self.vision_feature_select_strategy is None:
logger.warning_once(
"Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. "
"Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly "
"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed for cases with mutli-image inputs where we cannot be sure that number of image sizes is same as text. For ex, one text and two images

current_width = patches_height * scale_height
current_height = patches_width * scale_width
current_height = patches_height * scale_height
current_width = patches_width * scale_width
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order was corrected in other PR for modeling code but I forgot to add same correction to the processing logic

Comment on lines +391 to +398
img_token_not_enough = (input_ids == self.config.image_token_index).sum(
1
).max() < self.config.image_seq_length
video_token_not_enough = (input_ids == self.config.video_token_index).sum(
1
).max() < self.config.video_seq_length
inputs_not_expanded = (img_token_not_enough and pixel_values is not None) or (
video_token_not_enough and pixel_values_videos is not None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better check for multiple visual per prompt

Comment on lines +210 to +212
# Copied from transformers.models.llava_next.processing_llava_next.LlavaNextProcessor._get_number_of_features
def _get_number_of_features(self, orig_height: int, orig_width: int, height: int, width: int) -> int:
image_grid_pinpoints = self.image_processor.image_grid_pinpoints
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we forgot about image inputs for this model, added a test

@zucchini-nlp
Copy link
Member Author

@amyeroberts this is ready for review now, the blocking PR is merged. For the tests, the failing ones are SDPA-related which is fixed in #32238 (needs a separate review, but not a blocker for current PR).

The tests are fixed so that we match the runner's output. I verified that the discrepancy appears after updating torch to the latest version and the test was passing in 2.3.0

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this fix!

There's just one outstanding question I have about the iterator logic for image_sizes

prompt_strings = []
for sample in text:
while self.image_token in sample:
image_size = next(image_sizes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? The previous logic implies len(image_sizes) == len(text) however this is implying we have the same number of image_sizes as the number of image tokens per sample * number of samples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I verified twice to be sure. The number of images == number of image sizes. The new added test multiimage_expansion fails with the previous logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts do you have any other concerns regarding this PR?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for double checking the logic!

@zucchini-nlp zucchini-nlp merged commit 7d2d6ce into huggingface:main Sep 10, 2024
17 checks passed
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* leave only half of the changes

* fix tests

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix tests, first try

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix, second try

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* leave only half of the changes

* fix tests

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix tests, first try

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix, second try

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava

* fix

* [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants