[InternVL3]Add internvl3 quantizing example #1977

BigFaceBoy · 2025-10-29T11:45:27Z

SUMMARY:
LLM Compressor doesn't currently have any examples of InternVL3. It truely took me a lot of time to quantize it successful. So I want to share the example of it.

TEST PLAN:
"please outline how the changes were tested"

github-actions · 2025-10-29T11:45:35Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-10-29T11:45:44Z

Summary of Changes

Hello @BigFaceBoy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a much-needed example for quantizing the InternVL3-8B multimodal model within the LLM Compressor framework. It provides a complete guide, from initial model setup and data preparation to defining the quantization recipe and evaluating the results. The addition aims to simplify the process for other users who wish to apply quantization to this specific model, sharing the author's successful approach to a previously challenging task.

Highlights

New Quantization Example: Added a comprehensive example for quantizing the InternVL3-8B multimodal model, addressing a previous lack of such examples in the LLM Compressor framework.
Detailed Documentation: Introduced a new internvl3_README.md file that provides step-by-step instructions for preparing the model, setting up the dataset, defining the quantization recipe, and evaluating the quantized model's accuracy and performance.
Python Script for Quantization: Included a new Python script, internvl3_example.py, which implements the entire quantization workflow, including custom image preprocessing functions and a data collator tailored for InternVL3's multimodal inputs.
Specific Quantization Recipe: The example utilizes an 8-bit float quantization scheme for both weights and input activations, as well as the KV cache, with specific ignore patterns for lm_head and mlp1 layers.
Model Preparation Details: The example highlights the necessity of manually modifying the forward function in the modeling_internvl_chat.py file and placing a chat_template.jinja file in the local model directory for successful quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new example for quantizing the InternVL3-8B model, which is a valuable addition. The submission includes a README file with instructions and a corresponding Python script. My review focuses on improving the clarity, correctness, and maintainability of this new example.

The most critical issue is that the provided manual patch for the model's forward method appears to disable vision-processing capabilities. This is not explained and creates contradictions within the example, which is intended for a multimodal model. This could lead to significant confusion for users.

Other feedback includes correcting a typo and informal language in the documentation, removing unused code from the example script, and a small refactoring for efficiency and clarity. Addressing these points will make the example much more robust and easier for the community to use.

gemini-code-assist · 2025-10-29T11:47:19Z

examples/multimodal_vision/internvl3_README.md

+    def forward(
+            self,
+            pixel_values: torch.FloatTensor,
+            input_ids: torch.LongTensor = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            image_flags: Optional[torch.LongTensor] = None,
+            past_key_values: Optional[List[torch.FloatTensor]] = None,
+            labels: Optional[torch.LongTensor] = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        #image_flags = image_flags.squeeze(-1)
+        input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
+
+        # vit_embeds = self.extract_feature(pixel_values)
+        # vit_embeds = vit_embeds[image_flags == 1]
+        # vit_batch_size = pixel_values.shape[0]
+
+        # B, N, C = input_embeds.shape
+        # input_embeds = input_embeds.reshape(B * N, C)
+
+        # if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
+        #     print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
+
+        # input_ids = input_ids.reshape(B * N)
+        # selected = (input_ids == self.img_context_token_id)
+        # try:
+        #     input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
+        # except Exception as e:
+        #     vit_embeds = vit_embeds.reshape(-1, C)
+        #     print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, '
+        #           f'vit_embeds.shape={vit_embeds.shape}')
+        #     n_token = min(selected.sum(), vit_embeds.size(0))
+        #     input_embeds[selected][:n_token] = input_embeds[selected][:n_token] * 0.0 + vit_embeds[:n_token]
+
+        # input_embeds = input_embeds.reshape(B, N, C)
+
+        outputs = self.language_model(
+            inputs_embeds=input_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs.logits
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+```


This modified forward function has the vision processing logic commented out (lines 29-50). This means that during calibration, the model will not use the image data, even though the data loader prepares pixel_values. This is very confusing for an example about quantizing a multimodal model and contradicts later steps that evaluate multimodal performance.

Please add a clear explanation for why this is necessary. For example:

Is this a temporary workaround to quantize only the language model part first?

If so, how should a user proceed to quantize the vision part?

If the vision part is not being quantized, the evaluation steps should be adjusted accordingly.

Without this clarification, the example is misleading and users may not be able to correctly quantize their model.

gemini-code-assist · 2025-10-29T11:47:19Z

examples/multimodal_vision/internvl3_README.md

+- `ignore: ["re:.*lm_head",  "re:mlp1.*"]`: quantizing llm and vit.
+- `ignore: ["re:.*lm_head",  "re:mlp1.*", "re:vision_model.*"]`: quantizing llm only.


The explanation for the recipe options is contradictory given the modification to the forward function. You state that ignore: ["re:.*lm_head", "re:mlp1.*"] is for 'quantizing llm and vit'. However, the patched forward function seems to skip the vision transformer (ViT) execution entirely.

This inconsistency can confuse users. Please clarify if the ViT is actually being quantized. If it's not, the explanation should be corrected.

gemini-code-assist · 2025-10-29T11:47:19Z

examples/multimodal_vision/internvl3_README.md

+- 1. Download OpenGVLab/InternVL3-8B from hf
+- 2. Download  [chat_template.jinja](https://hf-mirror.com/OpenGVLab/InternVL3_5-8B/blob/main/chat_template.jinja), and place it in the local model dir of OpenGVLab/InternVL3-8B
+- 3. Replace the `forward` function in  OpenGVLab/InternVL3-8B/modeling_internvl_chat.py with the code below.
+   PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows. 


There's a typo here (dontshould bedon't`). Also, the phrasing 'Let me know if anyone knows' is a bit informal for documentation. It would be better to state that the reason is under investigation or simply omit this sentence.

Suggested change

PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows.

PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don't know why.

gemini-code-assist · 2025-10-29T11:47:19Z

examples/multimodal_vision/internvl3_example.py

+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values


The function load_image is defined here but it is never used in the script. The data_collator uses load_image_from_PIL instead. To improve code clarity and remove unused code, it's best to remove this function.

examples/multimodal_vision/internvl3_example.py

brian-dellabetta · 2025-10-29T17:12:09Z

examples/multimodal_vision/internvl3_example.py

+# IMPORTANT: Before running this script, you must manually modify the
+# `modeling_internvl_chat.py` file in your local copy of the `OpenGVLab/InternVL3-8B`
+# model directory 
+# Replace the original `forward` method of the `InternVLChatModel` class with the
+# version provided in `examples/multimodal_vision/internvl3_README.md`.
+# And put the `chat_template.jinja` in your local copy of the `OpenGVLab/InternVL3-8B`
+# model directory.


Rather than manually doing this, we might be able to replace modules with custom wrappers to handle this automatically, once this PR is in we can follow its pattern:

[MoE Calibration] Simplify MoE calibration interface #1851

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuwei fang <977502733@qq.com>

BigFaceBoy · 2025-11-04T09:03:34Z

I find out that there is problem in this way, so I`ll close this PR

add internvl3 quantizing example

ef3f912

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

brian-dellabetta mentioned this pull request Oct 29, 2025

[Bug]: InternVL3-8B quantize failed: TypeError: can only concatenate str (not "list") to str #1940

Open

brian-dellabetta reviewed Oct 29, 2025

View reviewed changes

brian-dellabetta mentioned this pull request Oct 29, 2025

[Bug]: InternVL2 support for AWQ quantization #1929

Open

Update examples/multimodal_vision/internvl3_example.py

2e90bee

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuwei fang <977502733@qq.com>

BigFaceBoy closed this Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[InternVL3]Add internvl3 quantizing example #1977

[InternVL3]Add internvl3 quantizing example #1977

BigFaceBoy commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Uh oh!

gemini-code-assist bot Oct 29, 2025

Uh oh!

gemini-code-assist bot Oct 29, 2025

Uh oh!

gemini-code-assist bot Oct 29, 2025

Uh oh!

Uh oh!

brian-dellabetta Oct 29, 2025

Uh oh!

BigFaceBoy commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- `ignore: ["re:.lm_head", "re:mlp1."]`: quantizing llm and vit.
		- `ignore: ["re:.lm_head", "re:mlp1.", "re:vision_model.*"]`: quantizing llm only.

	PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows.
	PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don't know why.

[InternVL3]Add internvl3 quantizing example #1977

[InternVL3]Add internvl3 quantizing example #1977

Conversation

BigFaceBoy commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

BigFaceBoy commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants