Skip to content

Fix TorchAOConfig skip layers #19147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

mobicham
Copy link
Contributor

@mobicham mobicham commented Jun 4, 2025

Purpose

The goal of this small PR is to fix loading torchao models where not all the layers have been quantized. The current implementation doesn't keep track of the skipped layers defined in config["modules_to_not_convert"]. As a result, quantized VL models where the vision head is not quantized results in a crash.

Also, made a quick fix to improve loading speed by avoiding creating an nn.Linear with the full tensor shape.

Test Plan

Dependencies

#Needs ao from this PR: https://github.com/pytorch/ao/pull/2288
USE_CPP=0 pip install git+https://github.com/mobicham/ao -v --no-build-isolation --use-pep517;
pip install git+https://github.com/mobiusml/gemlite/;

Code

import torch
from vllm import LLM
from vllm.sampling_params import SamplingParams

model_id = "mobiuslabsgmbh/Qwen2.5-VL-7B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit"
processor_args = {
    'limit_mm_per_prompt': {"image": 3}, 
    'mm_processor_kwargs': {"min_pixels": 28 * 28, "max_pixels": 1280 * 28 * 28},
    'disable_mm_preprocessor_cache': False,
}

llm = LLM(model=model_id, gpu_memory_utilization=0.9, dtype=torch.float16, max_model_len=4096, 
            max_num_batched_tokens=4096, **processor_args) 

Test Result

The model should load successfully.

Copy link

github-actions bot commented Jun 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mobicham, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as described by the author @mobicham, is to address an issue when loading TorchAO quantized models where not all layers have been quantized. Specifically, it aims to fix crashes that occur in models like quantized VL models where components like the vision head are intentionally left unquantized, by ensuring that the modules_to_not_convert configuration is correctly respected during the quantization process within vLLM. Additionally, the PR includes a small optimization to improve model loading speed by avoiding the creation of a large dummy torch.nn.Linear layer during quantization setup.

Highlights

  • Fix Partial Quantization Loading: The pull request fixes an issue where vLLM failed to correctly handle TorchAO models that were only partially quantized, leading to crashes when loading models with skipped layers (specified in modules_to_not_convert).
  • Respect modules_to_not_convert: The TorchAOConfig now correctly reads and utilizes the modules_to_not_convert list from the quantization configuration to identify and skip modules that should not be quantized.
  • Loading Speed Optimization: A minor optimization was introduced in the torchao_quantize_param_data function to improve loading speed by creating a small dummy torch.nn.Linear(1, 1) module instead of one with the full tensor dimensions when preparing parameters for quantization.

Changelog

  • vllm/model_executor/layers/quantization/torchao.py
    • Added skip_modules attribute to TorchAOConfig to store the list of modules to exclude from quantization.
    • Modified TorchAOConfig.from_config to extract the modules_to_not_convert list from the input configuration and pass it to the constructor.
    • Updated TorchAOConfig.get_quant_method to check if the current module's fully qualified name (prefix) is in the skip_modules list. If a match is found, it now returns UnquantizedLinearMethod, ensuring the module is not quantized.
    • Refactored torchao_quantize_param_data to create a torch.nn.Linear(1, 1) dummy module and set its in_features and out_features attributes, instead of creating a large torch.nn.Linear with the full parameter shape, which is more efficient.
    • Applied minor formatting changes to function signatures and imports.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Layers skipped with care,
Quantization finds its way,
Speed up the load time.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall, this pull request effectively addresses the issue of loading TorchAO models with skipped layers and includes a performance optimization for model loading. The changes are well-aligned with the PR description and test plan. The code is generally readable and follows common Python practices (implicitly aligning with principles found in guides like PEP 8).

Summary of Findings

  • Handling skipped modules: The PR correctly adds logic to the TorchAOConfig to store and check for modules that should not be converted, ensuring that these layers are handled as unquantized linear methods.
  • Efficiency improvement in torchao_quantize_param_data: The change to create a smaller dummy nn.Linear layer (1,1) and then manually set in_features and out_features is a clever optimization aimed at improving model loading speed, as noted in the PR description.
  • Potential KeyError in from_config: The from_config method assumes the presence of the "modules_to_not_convert" key in the input configuration dictionary. If this key is missing, it will result in a KeyError.

Merge Readiness

The pull request implements the core fixes and optimizations as described. There is one medium severity issue identified regarding the potential KeyError if the "modules_to_not_convert" key is missing from the config. I recommend addressing this issue before merging. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.

@houseroad
Copy link
Collaborator

cc: @jerryzh168

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unittest?

@mobicham
Copy link
Contributor Author

mobicham commented Jun 5, 2025

Can we add some unittest?

@houseroad
I added a Qwen-VL loading test: 3690738

self.torchao_config = torchao_config
self.skip_modules = skip_modules or []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that too, lemme take a look at the configs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here: c4201b6

You can test it with this code, it was breaking before this commit:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
config = Int4WeightOnlyConfig(group_size=128)

quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.o_proj": None, "model.layers.13.self_attn.o_proj": None})
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)

print("quantized model:", quantized_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

q_model_id = 'quant_model_test'
quantized_model.save_pretrained(q_model_id, safe_serialization=False) 
tokenizer.save_pretrained(q_model_id)
######################################################################################

import torch
from vllm import LLM
from vllm.sampling_params import SamplingParams
llm = LLM(model="quant_model_test", dtype=torch.bfloat16) 
sampling_params = SamplingParams(max_tokens=1024, temperature=0.5, repetition_penalty=1.1, ignore_eos=False)
messages = [{"content": "You are a helpful assistant", "role":"system"}, {"content":"Solve this equation x^2 + 1 = -1.", "role":"user"}]
outputs = llm.chat(messages, sampling_params, chat_template=llm.get_tokenizer().chat_template)
print(outputs[0].outputs[0].text)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).

not sure I follow, are you saying both skipped_module and ModuleFqnToConfig can't handle this, or skipped_module can handle this but ModuleFqnToConfig can't?

It seems to me the functionality of skipped_modules is already covered by ModuleFqnToConfig and this configuration is serializable to transformer models such as https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/config.json#L102 so users don't need to worry about having extra configs, so I think we should just use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the issue, the commit above handles that. The issue is that, in vLLM, q_proj + k_proj + v_proj are merged into a single tensor, so they need to use the same quant config or all None (same for mlp). Otherwise the logic works as you can see in the example above with o_proj.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so they need to use the same quant config or all None (same for mlp).

how is this enforced? are you saying torchao config should be aware that q_proj, k_proj and v_proj are connected and set them to use the same config automatically?

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is loading a model configured with module_not_to_convert, then makes sense

if c is not None:
current_torchao_config = TorchAOConfig(c)
current_torchao_config = TorchAOConfig(c, self.skip_modules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass skip_modules here? it's already been used in L86 right

Copy link

mergify bot commented Jun 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mobicham.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 6, 2025
@jerryzh168
Copy link
Contributor

please fix pre-commit and DCO

@mergify mergify bot removed the needs-rebase label Jun 6, 2025
Signed-off-by: mobicham <hicham@mobiuslabs.com>
@mergify mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) v1 tpu Related to Google TPUs tool-calling labels Jun 6, 2025
Copy link

mergify bot commented Jun 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mobicham.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mobicham
Copy link
Contributor Author

mobicham commented Jun 6, 2025

Damn, this DCO rebase screwed up stuff, sorry folks. I will close this PR and open a new one to avoid rebase issues.
Moving this to #19265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) tool-calling v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants