-
Notifications
You must be signed in to change notification settings - Fork 276
[InternVL3]Add internvl3 quantizing example #1977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @BigFaceBoy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a much-needed example for quantizing the InternVL3-8B multimodal model within the LLM Compressor framework. It provides a complete guide, from initial model setup and data preparation to defining the quantization recipe and evaluating the results. The addition aims to simplify the process for other users who wish to apply quantization to this specific model, sharing the author's successful approach to a previously challenging task. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new example for quantizing the InternVL3-8B model, which is a valuable addition. The submission includes a README file with instructions and a corresponding Python script. My review focuses on improving the clarity, correctness, and maintainability of this new example.
The most critical issue is that the provided manual patch for the model's forward method appears to disable vision-processing capabilities. This is not explained and creates contradictions within the example, which is intended for a multimodal model. This could lead to significant confusion for users.
Other feedback includes correcting a typo and informal language in the documentation, removing unused code from the example script, and a small refactoring for efficiency and clarity. Addressing these points will make the example much more robust and easier for the community to use.
| def forward( | ||
| self, | ||
| pixel_values: torch.FloatTensor, | ||
| input_ids: torch.LongTensor = None, | ||
| attention_mask: Optional[torch.Tensor] = None, | ||
| position_ids: Optional[torch.LongTensor] = None, | ||
| image_flags: Optional[torch.LongTensor] = None, | ||
| past_key_values: Optional[List[torch.FloatTensor]] = None, | ||
| labels: Optional[torch.LongTensor] = None, | ||
| use_cache: Optional[bool] = None, | ||
| output_attentions: Optional[bool] = None, | ||
| output_hidden_states: Optional[bool] = None, | ||
| return_dict: Optional[bool] = None, | ||
| ) -> Union[Tuple, CausalLMOutputWithPast]: | ||
| return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||
|
|
||
| #image_flags = image_flags.squeeze(-1) | ||
| input_embeds = self.language_model.get_input_embeddings()(input_ids).clone() | ||
|
|
||
| # vit_embeds = self.extract_feature(pixel_values) | ||
| # vit_embeds = vit_embeds[image_flags == 1] | ||
| # vit_batch_size = pixel_values.shape[0] | ||
|
|
||
| # B, N, C = input_embeds.shape | ||
| # input_embeds = input_embeds.reshape(B * N, C) | ||
|
|
||
| # if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0: | ||
| # print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}') | ||
|
|
||
| # input_ids = input_ids.reshape(B * N) | ||
| # selected = (input_ids == self.img_context_token_id) | ||
| # try: | ||
| # input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C) | ||
| # except Exception as e: | ||
| # vit_embeds = vit_embeds.reshape(-1, C) | ||
| # print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, ' | ||
| # f'vit_embeds.shape={vit_embeds.shape}') | ||
| # n_token = min(selected.sum(), vit_embeds.size(0)) | ||
| # input_embeds[selected][:n_token] = input_embeds[selected][:n_token] * 0.0 + vit_embeds[:n_token] | ||
|
|
||
| # input_embeds = input_embeds.reshape(B, N, C) | ||
|
|
||
| outputs = self.language_model( | ||
| inputs_embeds=input_embeds, | ||
| attention_mask=attention_mask, | ||
| position_ids=position_ids, | ||
| past_key_values=past_key_values, | ||
| use_cache=use_cache, | ||
| output_attentions=output_attentions, | ||
| output_hidden_states=output_hidden_states, | ||
| return_dict=return_dict, | ||
| ) | ||
| logits = outputs.logits | ||
|
|
||
| loss = None | ||
| if labels is not None: | ||
| # Shift so that tokens < n predict n | ||
| shift_logits = logits[..., :-1, :].contiguous() | ||
| shift_labels = labels[..., 1:].contiguous() | ||
| # Flatten the tokens | ||
| loss_fct = CrossEntropyLoss() | ||
| shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size) | ||
| shift_labels = shift_labels.view(-1) | ||
| # Enable model parallelism | ||
| shift_labels = shift_labels.to(shift_logits.device) | ||
| loss = loss_fct(shift_logits, shift_labels) | ||
|
|
||
| if not return_dict: | ||
| output = (logits,) + outputs[1:] | ||
| return (loss,) + output if loss is not None else output | ||
|
|
||
| return CausalLMOutputWithPast( | ||
| loss=loss, | ||
| logits=logits, | ||
| past_key_values=outputs.past_key_values, | ||
| hidden_states=outputs.hidden_states, | ||
| attentions=outputs.attentions, | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modified forward function has the vision processing logic commented out (lines 29-50). This means that during calibration, the model will not use the image data, even though the data loader prepares pixel_values. This is very confusing for an example about quantizing a multimodal model and contradicts later steps that evaluate multimodal performance.
Please add a clear explanation for why this is necessary. For example:
- Is this a temporary workaround to quantize only the language model part first?
- If so, how should a user proceed to quantize the vision part?
- If the vision part is not being quantized, the evaluation steps should be adjusted accordingly.
Without this clarification, the example is misleading and users may not be able to correctly quantize their model.
| - `ignore: ["re:.*lm_head", "re:mlp1.*"]`: quantizing llm and vit. | ||
| - `ignore: ["re:.*lm_head", "re:mlp1.*", "re:vision_model.*"]`: quantizing llm only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation for the recipe options is contradictory given the modification to the forward function. You state that ignore: ["re:.*lm_head", "re:mlp1.*"] is for 'quantizing llm and vit'. However, the patched forward function seems to skip the vision transformer (ViT) execution entirely.
This inconsistency can confuse users. Please clarify if the ViT is actually being quantized. If it's not, the explanation should be corrected.
| - 1. Download OpenGVLab/InternVL3-8B from hf | ||
| - 2. Download [chat_template.jinja](https://hf-mirror.com/OpenGVLab/InternVL3_5-8B/blob/main/chat_template.jinja), and place it in the local model dir of OpenGVLab/InternVL3-8B | ||
| - 3. Replace the `forward` function in OpenGVLab/InternVL3-8B/modeling_internvl_chat.py with the code below. | ||
| PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo here (dontshould bedon't`). Also, the phrasing 'Let me know if anyone knows' is a bit informal for documentation. It would be better to state that the reason is under investigation or simply omit this sentence.
| PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don`t know why. Let me know if anyone knows. | |
| PS:It is referred to [#1929](https://github.com/vllm-project/llm-compressor/issues/1929), but I don't know why. |
| def load_image(image_file, input_size=448, max_num=12): | ||
| image = Image.open(image_file).convert('RGB') | ||
| transform = build_transform(input_size=input_size) | ||
| images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) | ||
| pixel_values = [transform(image) for image in images] | ||
| pixel_values = torch.stack(pixel_values) | ||
| return pixel_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # IMPORTANT: Before running this script, you must manually modify the | ||
| # `modeling_internvl_chat.py` file in your local copy of the `OpenGVLab/InternVL3-8B` | ||
| # model directory | ||
| # Replace the original `forward` method of the `InternVLChatModel` class with the | ||
| # version provided in `examples/multimodal_vision/internvl3_README.md`. | ||
| # And put the `chat_template.jinja` in your local copy of the `OpenGVLab/InternVL3-8B` | ||
| # model directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than manually doing this, we might be able to replace modules with custom wrappers to handle this automatically, once this PR is in we can follow its pattern:
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuwei fang <977502733@qq.com>
|
I find out that there is problem in this way, so I`ll close this PR |
SUMMARY:
LLM Compressor doesn't currently have any examples of InternVL3. It truely took me a lot of time to quantize it successful. So I want to share the example of it.
TEST PLAN:
"please outline how the changes were tested"