-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLM: fixes after refactor #32907
VLM: fixes after refactor #32907
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very big PR that touches a lot of modeling code in different ways. I just started reviewing but I think we should split up the changes here: switching to num_image_tokens
is independent of fixing tests. This will make it easier to make sure the changes here are sensible.
A few comments:
- Why weren't the breaks caught previously? Was it lack of tests or the test fetcher? If lack of tests, tests should be added which will catch this
- I'm not sure about this switch to
num_image_tokens
-- can we guarantee this is always fixed? We need to properly handle this change in a non-breaking way - Slow tests should be run for all these models
patches_height = int(math.sqrt(self.num_image_tokens)) | ||
patches_width = int(math.sqrt(self.num_image_tokens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now assumes the image is square
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image after vision tower is a square, if i'm not mistaken? Or are there any vision backbones that have non-square format?
I am thinking of the way to not rely on the patch_size/image_size args, because it came out that some models add CLS token and some do not. But I may be still missing some edge cases
Oke, I will separate these into two PRs and then I have another one, which adds Generation tests in VLMs. That will be a third PR
|
42042cc
to
70a70b1
Compare
To be reviewed and merged after #33168 |
@zucchini-nlp OK - just ping when you want another review! |
prompt_strings = text | ||
if image_inputs: | ||
if self.patch_size is None or self.vision_feature_select_strategy is None: | ||
logger.warning_once( | ||
"Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. " | ||
"Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed for cases with mutli-image inputs where we cannot be sure that number of image sizes is same as text. For ex, one text and two images
current_width = patches_height * scale_height | ||
current_height = patches_width * scale_width | ||
current_height = patches_height * scale_height | ||
current_width = patches_width * scale_width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order was corrected in other PR for modeling code but I forgot to add same correction to the processing logic
img_token_not_enough = (input_ids == self.config.image_token_index).sum( | ||
1 | ||
).max() < self.config.image_seq_length | ||
video_token_not_enough = (input_ids == self.config.video_token_index).sum( | ||
1 | ||
).max() < self.config.video_seq_length | ||
inputs_not_expanded = (img_token_not_enough and pixel_values is not None) or ( | ||
video_token_not_enough and pixel_values_videos is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better check for multiple visual per prompt
# Copied from transformers.models.llava_next.processing_llava_next.LlavaNextProcessor._get_number_of_features | ||
def _get_number_of_features(self, orig_height: int, orig_width: int, height: int, width: int) -> int: | ||
image_grid_pinpoints = self.image_processor.image_grid_pinpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we forgot about image inputs for this model, added a test
@amyeroberts this is ready for review now, the blocking PR is merged. For the tests, the failing ones are SDPA-related which is fixed in #32238 (needs a separate review, but not a blocker for current PR). The tests are fixed so that we match the runner's output. I verified that the discrepancy appears after updating torch to the latest version and the test was passing in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this fix!
There's just one outstanding question I have about the iterator logic for image_sizes
prompt_strings = [] | ||
for sample in text: | ||
while self.image_token in sample: | ||
image_size = next(image_sizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? The previous logic implies len(image_sizes) == len(text)
however this is implying we have the same number of image_sizes as the number of image tokens per sample * number of samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I verified twice to be sure. The number of images == number of image sizes. The new added test multiimage_expansion
fails with the previous logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts do you have any other concerns regarding this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for double checking the logic!
* leave only half of the changes * fix tests * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix tests, first try * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix, second try * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava
* leave only half of the changes * fix tests * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix tests, first try * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix, second try * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava * fix * [run-slow] llava, llava_next, llava_next_video, vipllava, video_llava
What does this PR do?
After merging #30962 I noticed that some models stopped working in multi-image/image-video setting. Also
cache_position
wasn't being updated accordingly after merging image and text inputs in legacy way. This PR addresses the issue and adds slow tests for thatAlso, I think we should use
num_image_tokens
instead of trying to infer it from image size and patch size. The reason is that some image backbones add CLS (CLIP) and other don't (SigLIP). If we try to catch every edge case like this, we'll end up bloating the code, so it's better to let users set the number of tokens needed for one image after encoding with the vision tower. That way we can support different vision towers out-of-the-boxRan slow tests to test if expanding matches legacy, the new tests and a few older tests. Also changes video llava tests as those weren't formatting the prompt correctly