Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLMs: patch_size -> num_image_tokens in processing #33424

Merged
merged 9 commits into from
Nov 18, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Sep 11, 2024

What does this PR do?

Following #33374, we'll use num_image_tokens instead of patch_size VLM in processors. The reason is because some image backbones add a CLS token while other do not, and we cannot infer the number of tokens needed for an image from number of patches reliably.

Yes, as noted in some prev discussions, we'll have to assume for llava-next style models that the image is a square and obtain patch-width/patch-height from num_image_tokens. If I'm not mistaken, the backbones we have in the library and are used by VLMs are all square sized, and since llava-next class if not used with other backbones apart from CLIP afaik, we should be good to assume it that way.

Modified tests are passing locally for me

@zucchini-nlp zucchini-nlp changed the title use num_image_tokens VLMs: patch_size -> num_image_tokens in processing Sep 11, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR @zucchini-nlp!

I know this has been discussed elsewhere - but I can't remember exactly where - if you find it, could you link to it in this PRs description?

I still don't think num_image_tokens is a great solution -- I would have assumed the number of image tokens could be variable based on the input image size, which patch_size accounts for but num_image_tokens does not.

Assuming this is safe because of some of the backbones used with current checkpoints is an unreliable assumption. Part of the feature of a lot of these VLMs architectures is that different vision towers can swapped out easily and so it's very difficult to have strong assumptions about their properties. One option is to try and enforce these properties through validation.

I know part of the motivation is because patch_size calculations are also not reliable, although I don't think I've understood why adding the CLS token affects this? I would help to understand with an example to show where the issue is.

Comment on lines 138 to 139
prompt_strings = text

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated in the next lines

image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"

def __init__(
self,
image_processor=None,
tokenizer=None,
patch_size=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility, we can't just remove accepting patch_size immediately. We need to deprecate it and not allow both patch_size and num_image_tokens to be specified.

You can use the @deprecate_kwarg decorator to handle this

vision_feature_select_strategy=None,
chat_template=None,
image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases
**kwargs,
):
self.patch_size = patch_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - we need to deprecate this property

unpadded_features, newline_features = self._get_unpadded_features(
orig_height, orig_width, patches_height, patches_width, scale_height, scale_width
)
# The base patch covers the entire image (+1 for the CLS)
base_features = patches_height * patches_width + 1
base_features = self.num_image_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the numbers quite match here. Even if we assume the patches are square, this also assumes the image is square, which might not be the case.

Let's say the size of the image is 22x18 and a patch size is 4.

In the first case:

patches_height = 22 // 4 = 5
patches_width = 18 // 4 = 4

base_features = patches_height * patches_width + 1
              = 21 

If we then work backwards

num_image_tokens = 21
patches_height = patches_width = int(math.sqrt(self.num_image_tokens)) = 4

So this conversion isn't backwards compatible and I think it's also less correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. but also I don't think this is possible in available llava-next checkpoints

@zucchini-nlp
Copy link
Member Author

@amyeroberts

Yes, you're right that we are trying to make this work for current existing vision models and there may be some other models that do not have hard requirement for having a square image and thus square patches.

The problem with CLS is that num of image tokens is now computed as patches_height * patches_width for all cases and we add +1 by default to account for the CLS. But not all existing vision models add a cls token.

Since we have all VLMs on CLIP-like vision tower, I think we'll be safe in general either way we choose to implement. But apart from that, my main idea was that if some fancy architecture comes out where we won't be able to infer number of tokens from image/patch size. Then we can safely delegate setting correct number of image tokens for whatever vision model is used to users. I see this option as safer (imo) and more aligned with models like BLIP or Paligemma where we don't do special calculations to get how many image tokens are needed.

What do you think if we leave the patch_size as kwargs only for llava-next given that it is more quirky and requires to know how many patches each image is divided into? For all others we go with num_image_tokens to be on the safer side

@amyeroberts
Copy link
Collaborator

But apart from that, my main idea was that if some fancy architecture comes out where we won't be able to infer number of tokens from image/patch size. Then we can safely delegate setting correct number of image tokens for whatever vision model is used to users. I see this option as safer (imo) and more aligned with models like BLIP or Paligemma where we don't do special calculations to get how many image tokens are needed.

This seems to me like we're engineering for a problem which doesn't yet exist.

For all others we go with num_image_tokens to be on the safer side

My main reservation with this is that yes, we can safely delegate the number of tokens, but we can no longer safely calculate patch_size.

Would it be possible to have both arguments? Where if both are set we verify they're compatible?

As I'm off I'll delegate this decision to @qubvel. I'll be happy with whatever you both decide!

@zucchini-nlp
Copy link
Member Author

Hi @qubvel !

Can you look at the last discussion we had with Amy when you have time :)

@qubvel
Copy link
Member

qubvel commented Oct 8, 2024

I tried to summarize the pros and cons of existing options, feel free to add if I missed anything

1. patch_size and default +1 for CLS token

+ Support for dynamic image size
- Not all architectures have CLS token added, we cannot reliably infer num_image_tokens one from patch_size
- [Potential] There might be an architecture where we can't infer num_image_tokes based on patch_size and CLS token (examples?)

2. num_image_tokens

+ Do not depend on CLS token and can support custom num_image_tokens
- Still need patch_size for some processors, we have the same problem - we cannot reliably infer one from another
- From user's point of view it might be not obvious how to get num_image_tokens while patch size can be taken directly from the config

So, both of them are not ideal 🥲 Are there any other options? Maybe we can add to configs smth like num_special_tokens, wdyt?

@zucchini-nlp
Copy link
Member Author

@qubvel right, patch_size gave us flexibility to work with non-square images but we can't reliably get number of image tokens in some cases. I did a small overview of what types of VLMs are out there and

we can categorize VLMs image backbones into three types:

  • perceiver-like models have a fixes image seq length (Ideifcs, BLIP) so we are good here
  • models that use simple ViT architecture and optionally anyres splitting/pixel shuffle. We can be sure abuot image seq length in all cases if position interpolation is not on. AFAIK none of existing VLM papers use interpolation, but they rather try anyres/multiple encoders etc. for high resolutions. We can also use patch_size with same level of certainty and might need extra kwargs for specific models (convnext, internViT) to account for donwsampling
  • flexible image size models where the image is resized smartly like in Qwen2-VL. These models resize so that it fits in one of the possible shapes and preserved max of the original image. We can't use image seq length for them, and we should infer number of tokens from each image's size and other params like downsample ratio or merge size

This means we can't use one version of processing for all VLMs to get number of image tokens. And I am still more inclined to num_image_tokens because

  • we already use image_seq_length in other VLMs like Paligemma, Idefics, Chameleon and BLIP instead of trying to get number of tokens using heuristics
  • regarding the above concerns with non-square images, it seems like we try to accoun for smth that the model checkpoints don't support. In other words, non-square can be the case for llava only when positions in ViT are interpolated which is not used in any of the models
  • we still will have one model that needs special treatment. Qwen2-VL has dynamic seq length for each image depending on its size, so we leave it as is
  • yeah, from user perspective seems easier to pass in patch_size and hope everything works as magic. We are actually going to add self._get_image_features for all VLMs to make it more modular. Maybe users who want to tinker can rely on that method later if they decide to change ViT patch sizes?

At the end we'll do num_image_tokens or image_seq_length (to be aligned with existing code) for models where pixel shapes are fixed based on crop-size/resize. And for those with dynamic resize, we'll need to also dynamically calculate image seq length . WDYT?

Unfortunately I can't come up with a better solution that will work out of the box for all possible VLMs in all cases, but imo we don't have to. As long as the current approach works with the current model, and is consistent library-wise

@qubvel
Copy link
Member

qubvel commented Oct 10, 2024

Thanks for the very detailed clarification! Indeed it looks like processors' logic can vary significantly across models, the questions here are

  1. how to uniformize processors logic (would be easier for users and pipelines integration)
  2. make a reliable method to compute image_tokens/patch_size

My suggestion was to use num_special_tokens or num_additinal_tokens, which could be used instead of +1. In that case, we have to provide two arguments patch_size and num_special_tokens (it could be 0/1/2/... and can be a property of a model config, to be computed dynamically based on enabled cls token for example):

# for square images
num_image_tokens = (image_size / patch_size) ** 2 + num_special_tokens

For such a case we might dynamically compute the number of image tokens based on the resulting image size and patch_size to support interpolation. TBH, I'm not sure how it's scaling up to all architectures, it's just an idea to take into consideration 😄

I'm happy to go with your option too! And maybe just more examples/docs will be enough to make users comfortable!

@zucchini-nlp
Copy link
Member Author

how to uniformize processors logic (would be easier for users and pipelines integration)

Unfortunately relying on only one way to ave uniform/same kwargs is impossible at the moment since models have different requirements for vision backbone. Though we can use the patch_size + num_additional_tokens for VLMs with ViT backbone and image_seq_length for others where the length is fixed

Let me then add the num_additional_tokens, looks good to me and maybe less breaking (even though we didn't update model checkpoints yet)

@zucchini-nlp
Copy link
Member Author

Slow tests for expansion are passing all locally

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!
It'd be great to add some docs regarding where should users look for num_additional_image_tokens param. Should we add it as a config property/param?

@zucchini-nlp
Copy link
Member Author

yea, docs will be much needed given we already had some users asking about how to suppress the warning. I don't know if there is a place we can add info about VLMs in more details so I am thinking to add a WARNING section in model doc page for each touched model. Glad there are only ~5 models

@zucchini-nlp
Copy link
Member Author

Done, added the docs. LMK if it looks good to you and I'll request review from core maintainer. I hope after this PR we can update hub files and remove extra logic by v4.50 (deprecation target will be modified in #34245)

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit out of loop, so needs some clairiication!

"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly "
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` "
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was released a long time ago we can probably ignore it now no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR from @gante (#34245) that moves deprecation target a few more releases up, so we have time to update configs on the hub and for users to get used to new format

Imo we need at least 1 release to remove it totally

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly "
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` "
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are in 4.47

"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly "
"with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` "
"and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
"Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (unless this was not released!)

@@ -70,12 +79,14 @@ def __init__(
image_processor=None,
tokenizer=None,
patch_size=None,
num_additional_image_tokens=0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be placed at the end for BC as well 😉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it was set to 1 (i mean num_image_tokens -= 1 now it will use 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is important

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, will fix it and merge after making sure tests pass

@@ -158,8 +171,9 @@ def __call__(
else:
logger.warning_once(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be in 4.47

@@ -51,6 +51,9 @@ class LlavaProcessor(ProcessorMixin):
The tokenizer is a required input.
patch_size (`int`, *optional*):
Patch size from the vision tower.
num_additional_image_tokens (`int`, *optional*, defaults to 0):
Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are added where (to which image embedding and why) also I am not even sure I understand the usage of this myself. As it will be in all processors, can you explain a bit more why someone should set this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the very first reason for this PR is that some ViT backbones can add a CLS token to image patches, while other do not. Therefore our current processors can't work with SigLIP because we hardcoded the CLS token addition in code.

We had two options to fix it:

  • what I proposed initially was to simply use num_image_tokens arg and let users specify any amount so we don't have to infer how many patches there will be from patch_size and image_size. Some processor already do that like Paligemma
  • it had drawbacks though that we cannot work with non-square images in that case, because in case of llava-next the padding/unpadding is needed. That depends on patch-size and image-size. I realize we have no models that accept non-square images, but that was the strongest objection from core maintainer against the num-image-tokens

So after discussions with Pavel, we decided to make as less changes as possible and still support SigLIP by adding num_additional_tokens. It will be 1 if the vision tower adds CLS, and otherwise 0. It can be more than 1 in case there is any vision tower with extra tokens added on top of image patches

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for explaning!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just 1 place to fix

@zucchini-nlp
Copy link
Member Author

Cool, merging soon and I'll update hub configs this week for llava-hf models + open PRs for BLIP models. We've had the deprecation warning for 3-4 releases and should be fine with updating configs

@zucchini-nlp zucchini-nlp merged commit 1646ffb into huggingface:main Nov 18, 2024
18 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…3424)

* use num additional tokens

* fix copies + docs

* another fix copies :)

* add docs

* move order for BC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants