-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLMs: patch_size
-> num_image_tokens
in processing
#33424
VLMs: patch_size
-> num_image_tokens
in processing
#33424
Conversation
num_image_tokens
patch_size
-> num_image_tokens
in processing
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @zucchini-nlp!
I know this has been discussed elsewhere - but I can't remember exactly where - if you find it, could you link to it in this PRs description?
I still don't think num_image_tokens
is a great solution -- I would have assumed the number of image tokens could be variable based on the input image size, which patch_size
accounts for but num_image_tokens
does not.
Assuming this is safe because of some of the backbones used with current checkpoints is an unreliable assumption. Part of the feature of a lot of these VLMs architectures is that different vision towers can swapped out easily and so it's very difficult to have strong assumptions about their properties. One option is to try and enforce these properties through validation.
I know part of the motivation is because patch_size calculations are also not reliable, although I don't think I've understood why adding the CLS token affects this? I would help to understand with an example to show where the issue is.
prompt_strings = text | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeated in the next lines
image_processor_class = "AutoImageProcessor" | ||
tokenizer_class = "AutoTokenizer" | ||
|
||
def __init__( | ||
self, | ||
image_processor=None, | ||
tokenizer=None, | ||
patch_size=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility, we can't just remove accepting patch_size immediately. We need to deprecate it and not allow both patch_size
and num_image_tokens
to be specified.
You can use the @deprecate_kwarg decorator to handle this
vision_feature_select_strategy=None, | ||
chat_template=None, | ||
image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases | ||
**kwargs, | ||
): | ||
self.patch_size = patch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - we need to deprecate this property
unpadded_features, newline_features = self._get_unpadded_features( | ||
orig_height, orig_width, patches_height, patches_width, scale_height, scale_width | ||
) | ||
# The base patch covers the entire image (+1 for the CLS) | ||
base_features = patches_height * patches_width + 1 | ||
base_features = self.num_image_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the numbers quite match here. Even if we assume the patches are square, this also assumes the image is square, which might not be the case.
Let's say the size of the image is 22x18 and a patch size is 4.
In the first case:
patches_height = 22 // 4 = 5
patches_width = 18 // 4 = 4
base_features = patches_height * patches_width + 1
= 21
If we then work backwards
num_image_tokens = 21
patches_height = patches_width = int(math.sqrt(self.num_image_tokens)) = 4
So this conversion isn't backwards compatible and I think it's also less correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right. but also I don't think this is possible in available llava-next checkpoints
Yes, you're right that we are trying to make this work for current existing vision models and there may be some other models that do not have hard requirement for having a square image and thus square patches. The problem with CLS is that num of image tokens is now computed as Since we have all VLMs on CLIP-like vision tower, I think we'll be safe in general either way we choose to implement. But apart from that, my main idea was that if some fancy architecture comes out where we won't be able to infer number of tokens from image/patch size. Then we can safely delegate setting correct number of image tokens for whatever vision model is used to users. I see this option as safer (imo) and more aligned with models like BLIP or Paligemma where we don't do special calculations to get how many image tokens are needed. What do you think if we leave the patch_size as kwargs only for llava-next given that it is more quirky and requires to know how many patches each image is divided into? For all others we go with |
This seems to me like we're engineering for a problem which doesn't yet exist.
My main reservation with this is that yes, we can safely delegate the number of tokens, but we can no longer safely calculate patch_size. Would it be possible to have both arguments? Where if both are set we verify they're compatible? As I'm off I'll delegate this decision to @qubvel. I'll be happy with whatever you both decide! |
Hi @qubvel ! Can you look at the last discussion we had with Amy when you have time :) |
I tried to summarize the pros and cons of existing options, feel free to add if I missed anything 1.
|
@qubvel right, we can categorize VLMs image backbones into three types:
This means we can't use one version of processing for all VLMs to get number of image tokens. And I am still more inclined to
At the end we'll do Unfortunately I can't come up with a better solution that will work out of the box for all possible VLMs in all cases, but imo we don't have to. As long as the current approach works with the current model, and is consistent library-wise |
Thanks for the very detailed clarification! Indeed it looks like processors' logic can vary significantly across models, the questions here are
My suggestion was to use # for square images
num_image_tokens = (image_size / patch_size) ** 2 + num_special_tokens For such a case we might dynamically compute the number of image tokens based on the resulting image size and patch_size to support interpolation. TBH, I'm not sure how it's scaling up to all architectures, it's just an idea to take into consideration 😄 I'm happy to go with your option too! And maybe just more examples/docs will be enough to make users comfortable! |
Unfortunately relying on only one way to ave uniform/same kwargs is impossible at the moment since models have different requirements for vision backbone. Though we can use the Let me then add the |
1e3751b
to
95b5e9d
Compare
Slow tests for expansion are passing all locally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me!
It'd be great to add some docs regarding where should users look for num_additional_image_tokens
param. Should we add it as a config property/param?
yea, docs will be much needed given we already had some users asking about how to suppress the warning. I don't know if there is a place we can add info about VLMs in more details so I am thinking to add a WARNING section in model doc page for each touched model. Glad there are only ~5 models |
Done, added the docs. LMK if it looks good to you and I'll request review from core maintainer. I hope after this PR we can update hub files and remove extra logic by v4.50 (deprecation target will be modified in #34245) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit out of loop, so needs some clairiication!
"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` " | ||
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was released a long time ago we can probably ignore it now no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` " | ||
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are in 4.47
"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` " | ||
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (unless this was not released!)
@@ -70,12 +79,14 @@ def __init__( | |||
image_processor=None, | |||
tokenizer=None, | |||
patch_size=None, | |||
num_additional_image_tokens=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be placed at the end for BC as well 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it was set to 1
(i mean num_image_tokens -= 1
now it will use 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, will fix it and merge after making sure tests pass
@@ -158,8 +171,9 @@ def __call__( | |||
else: | |||
logger.warning_once( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be in 4.47
@@ -51,6 +51,9 @@ class LlavaProcessor(ProcessorMixin): | |||
The tokenizer is a required input. | |||
patch_size (`int`, *optional*): | |||
Patch size from the vision tower. | |||
num_additional_image_tokens (`int`, *optional*, defaults to 0): | |||
Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are added where (to which image embedding and why) also I am not even sure I understand the usage of this myself. As it will be in all processors, can you explain a bit more why someone should set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the very first reason for this PR is that some ViT backbones can add a CLS token to image patches, while other do not. Therefore our current processors can't work with SigLIP because we hardcoded the CLS token addition in code.
We had two options to fix it:
- what I proposed initially was to simply use
num_image_tokens
arg and let users specify any amount so we don't have to infer how many patches there will be frompatch_size
andimage_size
. Some processor already do that like Paligemma - it had drawbacks though that we cannot work with non-square images in that case, because in case of llava-next the padding/unpadding is needed. That depends on
patch-size
andimage-size
. I realize we have no models that accept non-square images, but that was the strongest objection from core maintainer against thenum-image-tokens
So after discussions with Pavel, we decided to make as less changes as possible and still support SigLIP by adding num_additional_tokens
. It will be 1 if the vision tower adds CLS, and otherwise 0. It can be more than 1 in case there is any vision tower with extra tokens added on top of image patches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks for explaning!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just 1 place to fix
Cool, merging soon and I'll update hub configs this week for llava-hf models + open PRs for BLIP models. We've had the deprecation warning for 3-4 releases and should be fine with updating configs |
…3424) * use num additional tokens * fix copies + docs * another fix copies :) * add docs * move order for BC
What does this PR do?
Following #33374, we'll use
num_image_tokens
instead ofpatch_size
VLM in processors. The reason is because some image backbones add a CLS token while other do not, and we cannot infer the number of tokens needed for an image from number of patches reliably.Yes, as noted in some prev discussions, we'll have to assume for llava-next style models that the image is a square and obtain patch-width/patch-height from
num_image_tokens
. If I'm not mistaken, the backbones we have in the library and are used by VLMs are all square sized, and since llava-next class if not used with other backbones apart from CLIP afaik, we should be good to assume it that way.Modified tests are passing locally for me