-
Notifications
You must be signed in to change notification settings - Fork 31.3k
SmolVLM2 #36126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SmolVLM2 #36126
Conversation
|
cc @ArthurZucker fyi this is needed for release otherwise it will be too much work on user's side 🥲 |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I added a couple comments! LMK if something is unclear and ping me when you're done with these, I'll re-iterate quickly 🤗
And a first comment, make sure to run the formatter/linter/checker locally:
You'll need to install dev tools within transformers repo
pip install -e .[quality]And then run this command, that'll cover all the checks to make CI happy
make fixup
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple comments for chat_template, let's go!
|
@molbap @zucchini-nlp
I believe all the comments on PR have been addressed. Let me know if I missed anything/if you have any new comments |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work everyone!
Not super fan of what's happening with the processor / video processor only doing half of the job each, we really need to split the work, keep it simpler:
video processor just returns post processed videos / sampled frames and some metadata, while the processor should merge text and these!
Main question: quid of multiturn
| in forward. Instead, we override inputs_merger here with custom logic. | ||
| """ | ||
|
|
||
| def inputs_merger( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned to @zucchini-nlp , the way you train, whether you use deepspeed or not is irrelevant to transformers. If this is training specifc / data pre-processing it should happen outside the modeling code. Please add a data collator for SmolVlm if you want people to use this for training!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge @molbap 's suggestion here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just use idefics3 merger then, with modular copying it? Looks pretty similar to me
| if not any(real_images_inds): | ||
| # no images, leave one empty image. | ||
| real_images_inds[0] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does inds mean
| # Handle the vision attention mask | ||
| if pixel_attention_mask is None: | ||
| pixel_attention_mask = torch.ones( | ||
| size=(pixel_values.size(0), pixel_values.size(2), pixel_values.size(3)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the shape function can help you 😄
| def is_url(val) -> bool: | ||
| return isinstance(val, str) and val.startswith("http") | ||
|
|
||
|
|
||
| def is_str(val) -> bool: | ||
| return isinstance(val, str) | ||
|
|
||
|
|
||
| def is_image_or_image_url(elem): | ||
| return is_url(elem) or is_valid_image(elem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of these especially given that we define them in processing utils or image utils where replacing the input chat template happens usually!
| } | ||
|
|
||
|
|
||
| SmolVLMProcessorKwargs.__annotations__["images_kwargs"] = SmolVLMImagesKwargs # python 3.8 compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we no longer support 3.8 so not needed
| the docstring of this method for more information. | ||
| """ | ||
| decode_output = self.tokenizer.decode(*args, **kwargs) | ||
| return self._regex_to_remove_extra_special_tokens.sub("<image>", decode_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry not sure I undertsand this?
| refer to the docstring of this method for more information. | ||
| """ | ||
| batched_decode_output = self.tokenizer.batch_decode(*args, **kwargs) | ||
| return [self._regex_to_remove_extra_special_tokens.sub("<image>", s) for s in batched_decode_output] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neither do I get why we have to do this?
| # Matches one or more occurrences of <row_x_col_y> tags (where x and y are digits, optionally surrounded by newline characters | ||
| self._regex_to_remove_extra_special_tokens = re.compile(r"(<row_\d+_col_\d+>\n?)+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add them as special tokens you won't go through all the trouble. These tokens are processed by the model anyways no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this to stay compatible with smolvlm1 -- I think we can drop it if we don't want to support that class here (I will just need to test).
| if past_seen_tokens == 0 and inputs_embeds is not None and image_hidden_states is not None: | ||
| # When we generate, we don't want to replace the potential image_token_id that we generated by images | ||
| # that simply don't exist | ||
| inputs_embeds = self.inputs_merger( | ||
| input_ids=input_ids, | ||
| inputs_embeds=inputs_embeds, | ||
| image_hidden_states=image_hidden_states, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question again here, but this disables the multi-turn image processing. Basically what if you want to use a different video/ pass other image?
|
Pushed a fix for the test_training being unhappy + an attempt at vectorizing the merger that seems to work, would need another eye on it :) |
|
@molbap thanks! I see you also added back the auto-map. The model was removed from |
|
Yeah, it's just that if the model is not on the correct mappings, |
|
@molbap I think we can add the |
|
alright let me do that! then should be good |
|
Thanks all! |
What does this PR do?
SmolVLM2 support
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?