SmolVLM2 #36126

orrzohar · 2025-02-11T04:48:00Z

What does this PR do?

SmolVLM2 support

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[yes] Did you read the contributor guideline,
Pull Request section?
[no] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

merveenoyan · 2025-02-11T08:31:33Z

cc @ArthurZucker fyi this is needed for release otherwise it will be too much work on user's side 🥲

molbap

Hey, I added a couple comments! LMK if something is unclear and ping me when you're done with these, I'll re-iterate quickly 🤗
And a first comment, make sure to run the formatter/linter/checker locally:
You'll need to install dev tools within transformers repo

pip install -e .[quality]

And then run this command, that'll cover all the checks to make CI happy

make fixup

src/transformers/__init__.py

src/transformers/models/smolvlm/modeling_smolvlm.py

src/transformers/models/smolvlm/image_processing_smolvlm.py

src/transformers/models/smolvlm/configuration_smolvlm.py

src/transformers/models/smolvlm/processing_smolvlm.py

src/transformers/models/smolvlm/video_processing_smolvlm.py

molbap

Added a couple comments for chat_template, let's go!

src/transformers/models/smolvlm/processing_smolvlm.py

…al transformers logic

src/transformers/models/smolvlm/processing_smolvlm.py

orrzohar · 2025-02-11T21:27:32Z

@molbap @zucchini-nlp
Overall comments:

I have committed a revision that uses modular transformers, as suggested by @molbap. This required a minor edit to modeling_idefics3 to be compatible with the modular workflow.
I have refactored the inputs to the processor to be closer to the standard
I have removed our overwrite of apply_chat_template in favor of a new function that converts the video tokens into the expected sequence of text and image tokens, so we can use the original apply_chat_template
I have updated load_video to include frame_indicies to allow users to pass what frame idx's they are interested in loading. I also created a get_video_details, which fetches video metadata (fps, duration, frame count). We now use load_video rather than having custom video handling logic in smolvlm.

I believe all the comments on PR have been addressed. Let me know if I missed anything/if you have any new comments

zucchini-nlp

Thanks!

src/transformers/models/smolvlm/modular_smolvlm.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

ArthurZucker

Great work everyone!
Not super fan of what's happening with the processor / video processor only doing half of the job each, we really need to split the work, keep it simpler:
video processor just returns post processed videos / sampled frames and some metadata, while the processor should merge text and these!

Main question: quid of multiturn

ArthurZucker · 2025-02-19T09:32:25Z

src/transformers/models/smolvlm/modular_smolvlm.py

+    in forward. Instead, we override inputs_merger here with custom logic.
+    """
+
+    def inputs_merger(


As I mentioned to @zucchini-nlp , the way you train, whether you use deepspeed or not is irrelevant to transformers. If this is training specifc / data pre-processing it should happen outside the modeling code. Please add a data collator for SmolVlm if you want people to use this for training!

let's merge @molbap 's suggestion here!

maybe we can just use idefics3 merger then, with modular copying it? Looks pretty similar to me

ArthurZucker · 2025-02-19T09:32:50Z

src/transformers/models/smolvlm/modular_smolvlm.py

+            if not any(real_images_inds):
+                # no images, leave one empty image.
+                real_images_inds[0] = True


What does inds mean

ArthurZucker · 2025-02-19T09:36:38Z

src/transformers/models/smolvlm/modular_smolvlm.py

+            # Handle the vision attention mask
+            if pixel_attention_mask is None:
+                pixel_attention_mask = torch.ones(
+                    size=(pixel_values.size(0), pixel_values.size(2), pixel_values.size(3)),


the shape function can help you 😄

ArthurZucker · 2025-02-19T09:58:10Z

src/transformers/models/smolvlm/processing_smolvlm.py

+def is_url(val) -> bool:
+    return isinstance(val, str) and val.startswith("http")
+
+
+def is_str(val) -> bool:
+    return isinstance(val, str)
+
+
+def is_image_or_image_url(elem):
+    return is_url(elem) or is_valid_image(elem)


not a fan of these especially given that we define them in processing utils or image utils where replacing the input chat template happens usually!

ArthurZucker · 2025-02-19T10:01:35Z

src/transformers/models/smolvlm/processing_smolvlm.py

+    }
+
+
+SmolVLMProcessorKwargs.__annotations__["images_kwargs"] = SmolVLMImagesKwargs  # python 3.8 compatibility


we no longer support 3.8 so not needed

ArthurZucker · 2025-02-19T10:16:58Z

src/transformers/models/smolvlm/processing_smolvlm.py

+        the docstring of this method for more information.
+        """
+        decode_output = self.tokenizer.decode(*args, **kwargs)
+        return self._regex_to_remove_extra_special_tokens.sub("<image>", decode_output)


sorry not sure I undertsand this?

ArthurZucker · 2025-02-19T10:17:08Z

src/transformers/models/smolvlm/processing_smolvlm.py

+        refer to the docstring of this method for more information.
+        """
+        batched_decode_output = self.tokenizer.batch_decode(*args, **kwargs)
+        return [self._regex_to_remove_extra_special_tokens.sub("<image>", s) for s in batched_decode_output]


neither do I get why we have to do this?

ArthurZucker · 2025-02-19T10:18:13Z

src/transformers/models/smolvlm/processing_smolvlm.py

+        # Matches one or more occurrences of <row_x_col_y> tags (where x and y are digits, optionally surrounded by newline characters
+        self._regex_to_remove_extra_special_tokens = re.compile(r"(<row_\d+_col_\d+>\n?)+")


if you add them as special tokens you won't go through all the trouble. These tokens are processed by the model anyways no?

I left this to stay compatible with smolvlm1 -- I think we can drop it if we don't want to support that class here (I will just need to test).

ArthurZucker · 2025-02-19T10:36:15Z

src/transformers/models/smolvlm/modular_smolvlm.py

+        if past_seen_tokens == 0 and inputs_embeds is not None and image_hidden_states is not None:
+            # When we generate, we don't want to replace the potential image_token_id that we generated by images
+            # that simply don't exist
+            inputs_embeds = self.inputs_merger(
+                input_ids=input_ids,
+                inputs_embeds=inputs_embeds,
+                image_hidden_states=image_hidden_states,
+            )


Question again here, but this disables the multi-turn image processing. Basically what if you want to use a different video/ pass other image?

molbap · 2025-02-19T13:09:07Z

Pushed a fix for the test_training being unhappy + an attempt at vectorizing the merger that seems to work, would need another eye on it :)

zucchini-nlp · 2025-02-19T13:11:43Z

@molbap thanks! I see you also added back the auto-map. The model was removed from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES intentionally. We should aim to use only the ImageTextToText mapping for VLMs, instead of duplicating over two auto-classes

molbap · 2025-02-19T13:24:57Z

Yeah, it's just that if the model is not on the correct mappings, test_training fails unless we add manually some labels (else, they are not added by the auto class mapper). + Idefics3 had this double mapping AFAIK?
+1 to removing it though as long as tests pass, esp. training tests!

zucchini-nlp · 2025-02-19T13:28:18Z

@molbap I think we can add the ImageTextToText mapping in _prepare_for_class to make tests happy, along with CausalMap etc. Idefics3 indeed has both, and all other VLMs too, but it doesn't really make sense to use Vision2Seq for new models. Especially after we released the pipeline and have been promoting the image-text-to-text tag on the hub

molbap · 2025-02-19T13:45:06Z

alright let me do that! then should be good

LysandreJik · 2025-02-20T14:00:22Z

Thanks all!

orrzohar added 8 commits February 10, 2025 15:52

smolvlm init

0be0e2a

updates

57b889f

fixing bugs

20cbacf

minimal run, no checks

0fb22df

minimal run, no checks

b5a82e9

passing first check + adding url support

0683ac3

updating video dataloading logic

9a3f708

fixing image logic

c82627a

ArthurZucker requested a review from molbap February 11, 2025 10:20

molbap reviewed Feb 11, 2025

View reviewed changes

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

molbap reviewed Feb 11, 2025

View reviewed changes

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

orrzohar added 2 commits February 11, 2025 09:46

trying modular, but fails

931a426

modular is working, changing processor to match PR comments and gener…

3ae022c

…al transformers logic

zucchini-nlp reviewed Feb 11, 2025

View reviewed changes

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

fixing kwargs

2988e1d

orrzohar added 12 commits February 11, 2025 15:34

offloading video loading logic to image_util

509679c

fixing circleci code formatting errors

4ccf399

fixing circleci code formatting errors

49d57cc

fixing circleci code formatting errors

60211aa

fixing circleci code formatting errors

25b3b19

fixing circleci code formatting errors

eee2703

fixing circleci code formatting errors

fe468f3

fixing circleci code formatting errors

4035b1c

fixing circleci code formatting errors

edb382c

fixing circleci code formatting errors

5f42cf5

fixing circleci code formatting errors

327d936

fixing circleci code formatting errors

56df14c

orrzohar added 2 commits February 18, 2025 11:19

removing un-needed dtype/device in model forward

47e6152

ruff for CI

60f4601

zucchini-nlp approved these changes Feb 18, 2025

View reviewed changes

src/transformers/models/smolvlm/modular_smolvlm.py Show resolved Hide resolved

orrzohar and others added 6 commits February 18, 2025 12:13

update docs

ce3ae8d

Update docs/source/en/model_doc/smolvlm.md

554c8e8

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

return cache position

37e4050

return cache position

c83ac66

return cache also in modular

6fcddb0

needed to run modular again

4b71593

ArthurZucker approved these changes Feb 19, 2025

View reviewed changes

molbap added 4 commits February 19, 2025 12:06

fix training tests

dac2e2d

push vectorized inputs merger

d9fa951

format

29aecb0

format

36a4328

molbap and others added 9 commits February 19, 2025 15:17

reduce number of mappings

ea12611

addressing PR comments

89357ed

happy CI, happy me :)

7a24ea2

skip non-nested images

c4748f4

adjust integration test for smaller GPUs

ad489ba

format

f86a5fd

fix kwargs in chat template apply

6bdbad8

skip this for now

a6deb26

Merge branch 'main' into smolvlm

52b5a5d

LysandreJik merged commit 4397dfc into huggingface:main Feb 20, 2025
19 of 21 checks passed

		}


		SmolVLMProcessorKwargs.__annotations__["images_kwargs"] = SmolVLMImagesKwargs # python 3.8 compatibility

		# Matches one or more occurrences of <row_x_col_y> tags (where x and y are digits, optionally surrounded by newline characters
		self._regex_to_remove_extra_special_tokens = re.compile(r"(<row_\d+_col_\d+>\n?)+")

SmolVLM2 #36126

SmolVLM2 #36126

Uh oh!

Conversation

orrzohar commented Feb 11, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

merveenoyan commented Feb 11, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

orrzohar commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

molbap commented Feb 19, 2025

Uh oh!

zucchini-nlp commented Feb 19, 2025

Uh oh!

molbap commented Feb 19, 2025

Uh oh!

zucchini-nlp commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molbap commented Feb 19, 2025

Uh oh!

LysandreJik commented Feb 20, 2025

Uh oh!

Uh oh!

orrzohar commented Feb 11, 2025 •

edited

Loading

zucchini-nlp commented Feb 19, 2025 •

edited

Loading