Skip to content

Conversation

@BigFaceBoy
Copy link

SUMMARY:
LLM Compressor doesn't currently have any examples of InternVL3. It truely took me a lot of time to quantize it successful. So I want to share the example of it.

TEST PLAN:
"please outline how the changes were tested"

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @BigFaceBoy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a much-needed example for quantizing the InternVL3-8B multimodal model within the LLM Compressor framework. It provides a complete guide, from initial model setup and data preparation to defining the quantization recipe and evaluating the results. The addition aims to simplify the process for other users who wish to apply quantization to this specific model, sharing the author's successful approach to a previously challenging task.

Highlights

  • New Quantization Example: Added a comprehensive example for quantizing the InternVL3-8B multimodal model, addressing a previous lack of such examples in the LLM Compressor framework.
  • Detailed Documentation: Introduced a new internvl3_README.md file that provides step-by-step instructions for preparing the model, setting up the dataset, defining the quantization recipe, and evaluating the quantized model's accuracy and performance.
  • Python Script for Quantization: Included a new Python script, internvl3_example.py, which implements the entire quantization workflow, including custom image preprocessing functions and a data collator tailored for InternVL3's multimodal inputs.
  • Specific Quantization Recipe: The example utilizes an 8-bit float quantization scheme for both weights and input activations, as well as the KV cache, with specific ignore patterns for lm_head and mlp1 layers.
  • Model Preparation Details: The example highlights the necessity of manually modifying the forward function in the modeling_internvl_chat.py file and placing a chat_template.jinja file in the local model directory for successful quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example for quantizing the InternVL3-8B model, which is a valuable addition. The submission includes a README file with instructions and a corresponding Python script. My review focuses on improving the clarity, correctness, and maintainability of this new example.

The most critical issue is that the provided manual patch for the model's forward method appears to disable vision-processing capabilities. This is not explained and creates contradictions within the example, which is intended for a multimodal model. This could lead to significant confusion for users.

Other feedback includes correcting a typo and informal language in the documentation, removing unused code from the example script, and a small refactoring for efficiency and clarity. Addressing these points will make the example much more robust and easier for the community to use.

Comment on lines +10 to +88
def forward(
self,
pixel_values: torch.FloatTensor,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
image_flags: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

#image_flags = image_flags.squeeze(-1)
input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()

# vit_embeds = self.extract_feature(pixel_values)
# vit_embeds = vit_embeds[image_flags == 1]
# vit_batch_size = pixel_values.shape[0]

# B, N, C = input_embeds.shape
# input_embeds = input_embeds.reshape(B * N, C)

# if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
# print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')

# input_ids = input_ids.reshape(B * N)
# selected = (input_ids == self.img_context_token_id)
# try:
# input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
# except Exception as e:
# vit_embeds = vit_embeds.reshape(-1, C)
# print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, '
# f'vit_embeds.shape={vit_embeds.shape}')
# n_token = min(selected.sum(), vit_embeds.size(0))
# input_embeds[selected][:n_token] = input_embeds[selected][:n_token] * 0.0 + vit_embeds[:n_token]

# input_embeds = input_embeds.reshape(B, N, C)

outputs = self.language_model(
inputs_embeds=input_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs.logits

loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)

if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output

return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This modified forward function has the vision processing logic commented out (lines 29-50). This means that during calibration, the model will not use the image data, even though the data loader prepares pixel_values. This is very confusing for an example about quantizing a multimodal model and contradicts later steps that evaluate multimodal performance.

Please add a clear explanation for why this is necessary. For example:

  • Is this a temporary workaround to quantize only the language model part first?
  • If so, how should a user proceed to quantize the vision part?
  • If the vision part is not being quantized, the evaluation steps should be adjusted accordingly.

Without this clarification, the example is misleading and users may not be able to correctly quantize their model.

Comment on lines +145 to +146
- `ignore: ["re:.*lm_head", "re:mlp1.*"]`: quantizing llm and vit.
- `ignore: ["re:.*lm_head", "re:mlp1.*", "re:vision_model.*"]`: quantizing llm only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The explanation for the recipe options is contradictory given the modification to the forward function. You state that ignore: ["re:.*lm_head", "re:mlp1.*"] is for 'quantizing llm and vit'. However, the patched forward function seems to skip the vision transformer (ViT) execution entirely.

This inconsistency can confuse users. Please clarify if the ViT is actually being quantized. If it's not, the explanation should be corrected.

- 1. Download OpenGVLab/InternVL3-8B from hf
- 2. Download [chat_template.jinja](https://hf-mirror.com/OpenGVLab/InternVL3_5-8B/blob/main/chat_template.jinja), and place it in the local model dir of OpenGVLab/InternVL3-8B
- 3. Replace the `forward` function in OpenGVLab/InternVL3-8B/modeling_internvl_chat.py with the code below.
PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo here (dontshould bedon't`). Also, the phrasing 'Let me know if anyone knows' is a bit informal for documentation. It would be better to state that the reason is under investigation or simply omit this sentence.

Suggested change
PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows.
PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don't know why.

Comment on lines +100 to +106
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function load_image is defined here but it is never used in the script. The data_collator uses load_image_from_PIL instead. To improve code clarity and remove unused code, it's best to remove this function.

Comment on lines +10 to +16
# IMPORTANT: Before running this script, you must manually modify the
# `modeling_internvl_chat.py` file in your local copy of the `OpenGVLab/InternVL3-8B`
# model directory
# Replace the original `forward` method of the `InternVLChatModel` class with the
# version provided in `examples/multimodal_vision/internvl3_README.md`.
# And put the `chat_template.jinja` in your local copy of the `OpenGVLab/InternVL3-8B`
# model directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than manually doing this, we might be able to replace modules with custom wrappers to handle this automatically, once this PR is in we can follow its pattern:

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xuwei fang <977502733@qq.com>
@BigFaceBoy
Copy link
Author

I find out that there is problem in this way, so I`ll close this PR

@BigFaceBoy BigFaceBoy closed this Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants