Uiuc vlm pr compressed fixed by immuntasir · Pull Request #511 · google/tunix

immuntasir · 2025-10-06T19:06:31Z

Resolves #510

This PR introduces multimodal support to Tunix’s Gemma3 model and adds a new vision-language DPO demonstration notebook (vl_dpo_demo_gemma3.ipynb), extending the framework to handle image-text reasoning and multimodal alignment. Key changes includes:

Multimodal Gemma3 Model:
1. Added SigLIP vision encoder and MultimodalProjector to Gemma3, enabling joint image–text forward passes.
2. Updated ModelConfig and parameter mapping to support multimodal checkpoints (multimodal=True).
DPO Example Notebook: Introduced examples/vl_dpo_demo_gemma3.ipynb, demonstrating multimodal DPO fine-tuning with image-conditioned prompts using multimodal gemma3
Gemma3 Tokenizer Adapter: Extended TokenizerAdapter to support Hugging Face Processor objects
VLM Sampler and Utils:
1. Added VLMSampler for multimodal generation (tunix/generate/vlm_sampler.py)
2. Added preprocess_image() utility in tunix/generate/utils.py for SigLIP/CLIP-style normalization.
DPO Pipeline Updates: Modified tunix/sft/dpo/dpo_trainer.py to handle multimodal data ({“text”, “image”} prompts) and propagate image tensors through training.

@Tianjiao-Yu led this effort and @jxiong21029 contributed to the Gemma3 integration. Please also mention @Tianjiao-Yu if you have any questions/comments/feedback.

Colab Notebook
vl_dpo_demo_gemma3.ipynb

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and all unit tests pass.
I have added all appropriate doc-strings/documentation.
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have signed the Contributor License Agreement.
I have followed Contribution Guidelines.

google-cla · 2025-10-06T19:06:37Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

abheesht17 · 2025-10-10T04:05:45Z

@immuntasir - let me know when this is ready for review!

immuntasir · 2025-10-10T15:34:00Z

@immuntasir - let me know when this is ready for review!

@Tianjiao-Yu confirmed that this is ready for review.

immuntasir · 2025-10-11T01:19:03Z

examples/dpo_demo_gemma3.ipynb

@Tianjiao-Yu I think this should be removed from this PR.

abheesht17

Quick review, I'll do another pass tomorrow

abheesht17 · 2025-10-13T16:31:48Z

examples/vl_dpo_demo_gemma3.ipynb

+      "source": [
+        "# Fine-tuning a Visual Language Model (VLM) using DPO\n",
+        "\n",
+        "This notebook demonstrates how to fine-tune a Visual Language Model (VLM), specifically the Gemma 3-1B-it model, using the Direct Preference Optimization (DPO) algorithm.\n",


Gemma 3-1B-it model

This is a text-only model though. 4B onwards are VLMs

examples/vl_dpo_demo_gemma3.ipynb

abheesht17 · 2025-10-13T16:37:54Z

tunix/sft/dpo/dpo_trainer.py


  This can be used when inputs are raw strings. Tokenization, padding and
-  preprocessing is taken care of by `DPOTrainer`.
+  preprocessing is taken care of by `DpoTrainer`.


abheesht17 · 2025-10-13T16:42:05Z

examples/dpo_demo_gemma3.ipynb

abheesht17 · 2025-10-13T16:43:17Z

tunix/generate/tokenizer_adapter.py

+    elif self._tokenizer_type == TokenizerType.HFP:
+      inputs = self._tokenizer(text=text, **kwargs)
+      if 'images' in kwargs:
+        return inputs['input_ids'], inputs['pixel_values']


Better to return a dictionary here rather than a tuple (in case we add more modalities later)?

abheesht17 · 2025-10-13T16:44:57Z

tunix/generate/tokenizer_adapter.py

  HF: str = 'hf'  # huggingface tokenizer
+  HFP: str = 'hfp'  # huggingface processor


Is the only difference between these two that the processor can take images, and other modalities too? If yes, do you think we should just use HF processor everywhere (and remove HF tokeniser)?

Because if processor(text) works, we can just use processor everywhere

I don't think every tokenizer has an associated processor definition, so it probably makes sense to have both.

abheesht17 · 2025-10-13T16:48:32Z

tunix/generate/utils.py

+# Defaults compatible with CLIP / many SigLIP configs; override if needed.
+_CLIP_MEAN = jnp.array([0.48145466, 0.4578275, 0.40821073], dtype=jnp.float32)
+_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)


Do you think we can move it to models/siglip?

abheesht17 · 2025-10-13T16:49:29Z

tunix/generate/utils.py

+    mean: Iterable[float] = _CLIP_MEAN,
+    std: Iterable[float] = _CLIP_STD,
+) -> jnp.ndarray:
+  """Resize + normalize images for SigLIP.


Just SigLIP? Does it not work for other vision models? In generate/utils.py, we should have generic functions (as much as possible)

abheesht17 · 2025-10-13T16:51:21Z

tunix/models/gemma3/model.py

+
+    if self.config.multimodal:
+      assert pixel_values is not None
+      image_mask = last_tokens == 262144  # 262144: <image_soft_token>


Better to define this somewhere instead of hardcoding

jxiong21029 · 2025-11-01T20:55:26Z

Oh, I didn't mean to request so many reviews. Not sure how that happened. Maybe from the CLA failing?

…ipynb

abheesht17 · 2026-01-23T04:52:54Z

Hi, Abheesht, I made a small update to switch the demo to use Gemma-3-4B-IT, which is a true VLM, instead of Gemma-3-1B-IT. Could you please take a quick look when you have a moment? After that, I’ll proceed with resolving the remaining merge conflicts. Thanks!

Looks good. Could you please give me edit access to this branch? I'll resolve merge conflicts and make a few changes (especially regarding the multiple images point). Thanks!

ridcl · 2026-01-23T10:33:09Z

@abheesht17 For the multiple image support, you may want to look at this commit as a reference point (mostly files tunix/models/gemma3/model.py, tunix/models/siglip/model.py and tunix/models/siglip/preprocess.py, the rest should be formatting).

Another change that you might be interested in is saving LoRA params for multimodal Gemma. Alternatively, I can create a clear pull request for it after the current PR is merged.

abheesht17

I was going through this again, and found a few issues:

Image tokens should have bidirectional attention, but I don't see that in the code.
We should support multiple images.
We have a Hugging Face preprocessor ("hfp"), but we don't seem to be using it. Also, the special tokens in HF preprocessor/tokeniser are different from the upstream GDM implementation.
Gemma 3 uses special start of image tokens, end of image tokens, etc., which are not there in the code.

I have a WIP PR for resolving some of these issues. Give me some time.

abheesht17 · 2026-01-26T02:08:11Z

tunix/models/siglip/preprocess.py

+_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)
+
+
+def preprocess(


I don't see this function being used anywhere

abheesht17 · 2026-01-26T02:09:07Z

examples/vl_dpo_gemma3.ipynb

+      },
+      "outputs": [],
+      "source": [
+        "gemma_tokenizer = tokenizer_lib.Tokenizer(tokenizer_path=GEMMA_TOKENIZER_PATH)"


Why can we not use the HF processor directly?

abheesht17 · 2026-01-26T02:27:24Z

examples/vl_dpo_demo_gemma3.ipynb

+        "model_config = dataclasses.replace(\n",
+        "    model_config, multimodal=True, num_embed=262208\n",


Why don't we just expose multimodal as an arg in gemma3_model_lib.ModelConfig.gemma3_4b(multimodal=True)?

abheesht17 · 2026-01-26T02:37:44Z

tunix/models/gemma3/model.py

@@ -927,18 +1001,26 @@ def __call__(
      positions: jaxtyping.Array,  # [B, L]
      cache: Cache | None,  # (sequence length L')
      attention_mask: jaxtyping.Array,  # [B, L, L']


Gemma 3 is supposed to have bidirectional attention for image tokens, but I don't see that here, or in the VLM DPO notebook.

…ed SigLIP preprocess

…sed-fixed

PiperOrigin-RevId: 884468159

jxiong21029 force-pushed the uiuc-vlm-pr-compressed-fixed branch from 052a484 to 03a1804 Compare October 9, 2025 15:37

Tianjiao-Yu force-pushed the uiuc-vlm-pr-compressed-fixed branch from 03a1804 to 0b1d000 Compare October 10, 2025 02:48

abheesht17 self-requested a review October 10, 2025 04:05

immuntasir commented Oct 11, 2025

View reviewed changes

abheesht17 reviewed Oct 13, 2025

View reviewed changes

jxiong21029 requested review from hgao327, jiangyangmu, lc5211, sizhit2, tianshub and wang2yn84 as code owners November 1, 2025 19:20

Tianjiao-Yu and others added 15 commits November 1, 2025 17:36

Add SigLIP and Paligemma code

73e41a0

Add multimodal support for Gemma 3

191dbd7

Begin implementing VL-DPO trainer

756c3d6

Add DPO demo

154c464

Add dpo_demo_gemma3_v2.ipynb

a583fbe

Edit VL-DPO trainer

3880a16

Add fixes for dpo_demo_gemma3_v2.ipynb and create dpo_demo_gemma3_v3.…

d48af92

…ipynb

Modify tests for vl_dpo_trainer

969d930

Begin integrating image inputs support in DPOTrainer

100dea3

Remove unused files and update vl_dpo_demo_gemma3

611962d

restore siglip code (needed for multimodal Gemma models)

f4038a9

Add SigLIP and Paligemma code

d45d0d6

Add multimodal support for Gemma 3

5413a10

Begin implementing VL-DPO trainer

74de87a

Add DPO demo

1c26c2f

Fix merge conflicts

d170ceb

abheesht17 requested changes Jan 26, 2026

View reviewed changes

Add Gemma3 multimodal attention mask, update DPO trainer, remove unus…

1c4d1ff

…ed SigLIP preprocess

ridcl mentioned this pull request Feb 7, 2026

Qwen3-VL #1063

Open

2 tasks

abheesht17 added 9 commits March 9, 2026 10:28

Merge remote-tracking branch 'upstream/main' into uiuc-vlm-pr-compres…

6cbae3b

…sed-fixed

Remove extraneous changes

912ae53

Pipe images correctly to the model

7fcf453

Fix indentation

b1ee6fe

Update DPO Trainer with changes made to main

87c258d

Fix raw image inputs

69df224

Clean up example

06095f0

Rename example file

b8ae3f0

Small fixes

816037b

abheesht17 had a problem deploying to testing March 9, 2026 13:54 — with GitHub Actions Failure

abheesht17 and others added 3 commits March 16, 2026 09:54

Merge branch 'google:main' into uiuc-vlm-pr-compressed-fixed

0e57a81

Small fixes

e0d21ec

Fixes

f70c30e

abheesht17 had a problem deploying to testing March 16, 2026 05:39 — with GitHub Actions Failure

Fix input construction

2ec56a1

abheesht17 temporarily deployed to testing March 16, 2026 10:59 — with GitHub Actions Inactive

Small doc-string edit

13b834d

abheesht17 approved these changes Mar 16, 2026

View reviewed changes

abheesht17 had a problem deploying to testing March 16, 2026 15:06 — with GitHub Actions Error

Remove PIL

f9d4d85

abheesht17 temporarily deployed to testing March 16, 2026 15:16 — with GitHub Actions Inactive

copybara-service bot merged commit e5431b4 into google:main Mar 16, 2026
9 checks passed

xianglon-commits pushed a commit to xianglon-commits/tunix that referenced this pull request Apr 1, 2026

Merge pull request google#511 from PLAN-Lab:uiuc-vlm-pr-compressed-fixed

dced3d8

PiperOrigin-RevId: 884468159

ridcl mentioned this pull request Apr 9, 2026

Clarification on (external) contribution policy #1384

Open

		HF: str = 'hf' # huggingface tokenizer
		HFP: str = 'hfp' # huggingface processor

		_CLIP_STD = jnp.array([0.26862954, 0.26130258, 0.27577711], dtype=jnp.float32)


		def preprocess(

		"model_config = dataclasses.replace(\n",
		" model_config, multimodal=True, num_embed=262208\n",

Conversation

immuntasir commented Oct 6, 2025

Uh oh!

google-cla bot commented Oct 6, 2025

Uh oh!

abheesht17 commented Oct 10, 2025

Uh oh!

immuntasir commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jxiong21029 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abheesht17 commented Jan 23, 2026

Uh oh!

ridcl commented Jan 23, 2026

Uh oh!

abheesht17 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jxiong21029 commented Nov 1, 2025 •

edited

Loading

abheesht17 left a comment •

edited

Loading

abheesht17 Jan 26, 2026 •

edited

Loading