-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Fix TorchAOConfig skip layers #19147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mobicham, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as described by the author @mobicham, is to address an issue when loading TorchAO quantized models where not all layers have been quantized. Specifically, it aims to fix crashes that occur in models like quantized VL models where components like the vision head are intentionally left unquantized, by ensuring that the modules_to_not_convert
configuration is correctly respected during the quantization process within vLLM. Additionally, the PR includes a small optimization to improve model loading speed by avoiding the creation of a large dummy torch.nn.Linear
layer during quantization setup.
Highlights
- Fix Partial Quantization Loading: The pull request fixes an issue where vLLM failed to correctly handle TorchAO models that were only partially quantized, leading to crashes when loading models with skipped layers (specified in
modules_to_not_convert
). - Respect
modules_to_not_convert
: TheTorchAOConfig
now correctly reads and utilizes themodules_to_not_convert
list from the quantization configuration to identify and skip modules that should not be quantized. - Loading Speed Optimization: A minor optimization was introduced in the
torchao_quantize_param_data
function to improve loading speed by creating a small dummytorch.nn.Linear(1, 1)
module instead of one with the full tensor dimensions when preparing parameters for quantization.
Changelog
- vllm/model_executor/layers/quantization/torchao.py
- Added
skip_modules
attribute toTorchAOConfig
to store the list of modules to exclude from quantization. - Modified
TorchAOConfig.from_config
to extract themodules_to_not_convert
list from the input configuration and pass it to the constructor. - Updated
TorchAOConfig.get_quant_method
to check if the current module's fully qualified name (prefix
) is in theskip_modules
list. If a match is found, it now returnsUnquantizedLinearMethod
, ensuring the module is not quantized. - Refactored
torchao_quantize_param_data
to create atorch.nn.Linear(1, 1)
dummy module and set itsin_features
andout_features
attributes, instead of creating a largetorch.nn.Linear
with the full parameter shape, which is more efficient. - Applied minor formatting changes to function signatures and imports.
- Added
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Layers skipped with care,
Quantization finds its way,
Speed up the load time.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
Overall, this pull request effectively addresses the issue of loading TorchAO models with skipped layers and includes a performance optimization for model loading. The changes are well-aligned with the PR description and test plan. The code is generally readable and follows common Python practices (implicitly aligning with principles found in guides like PEP 8).
Summary of Findings
- Handling skipped modules: The PR correctly adds logic to the
TorchAOConfig
to store and check for modules that should not be converted, ensuring that these layers are handled as unquantized linear methods. - Efficiency improvement in
torchao_quantize_param_data
: The change to create a smaller dummynn.Linear
layer (1,1) and then manually setin_features
andout_features
is a clever optimization aimed at improving model loading speed, as noted in the PR description. - Potential KeyError in
from_config
: Thefrom_config
method assumes the presence of the"modules_to_not_convert"
key in the input configuration dictionary. If this key is missing, it will result in aKeyError
.
Merge Readiness
The pull request implements the core fixes and optimizations as described. There is one medium severity issue identified regarding the potential KeyError
if the "modules_to_not_convert"
key is missing from the config. I recommend addressing this issue before merging. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.
cc: @jerryzh168 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some unittest?
@houseroad |
self.torchao_config = torchao_config | ||
self.skip_modules = skip_modules or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use https://huggingface.co/docs/transformers/main/en/quantization/torchao#per-module-quantization to skip modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that too, lemme take a look at the configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here: c4201b6
You can test it with this code, it was breaking before this commit:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
config = Int4WeightOnlyConfig(group_size=128)
quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.o_proj": None, "model.layers.13.self_attn.o_proj": None})
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
print("quantized model:", quantized_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
q_model_id = 'quant_model_test'
quantized_model.save_pretrained(q_model_id, safe_serialization=False)
tokenizer.save_pretrained(q_model_id)
######################################################################################
import torch
from vllm import LLM
from vllm.sampling_params import SamplingParams
llm = LLM(model="quant_model_test", dtype=torch.bfloat16)
sampling_params = SamplingParams(max_tokens=1024, temperature=0.5, repetition_penalty=1.1, ignore_eos=False)
messages = [{"content": "You are a helpful assistant", "role":"system"}, {"content":"Solve this equation x^2 + 1 = -1.", "role":"user"}]
outputs = llm.chat(messages, sampling_params, chat_template=llm.get_tokenizer().chat_template)
print(outputs[0].outputs[0].text)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).
not sure I follow, are you saying both skipped_module and ModuleFqnToConfig can't handle this, or skipped_module can handle this but ModuleFqnToConfig can't?
It seems to me the functionality of skipped_modules is already covered by ModuleFqnToConfig and this configuration is serializable to transformer models such as https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/config.json#L102 so users don't need to worry about having extra configs, so I think we should just use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the issue, the commit above handles that. The issue is that, in vLLM, q_proj + k_proj + v_proj are merged into a single tensor, so they need to use the same quant config or all None (same for mlp). Otherwise the logic works as you can see in the example above with o_proj.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so they need to use the same quant config or all None (same for mlp).
how is this enforced? are you saying torchao config should be aware that q_proj, k_proj and v_proj are connected and set them to use the same config automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is loading a model configured with module_not_to_convert
, then makes sense
if c is not None: | ||
current_torchao_config = TorchAOConfig(c) | ||
current_torchao_config = TorchAOConfig(c, self.skip_modules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to pass skip_modules
here? it's already been used in L86 right
This pull request has merge conflicts that must be resolved before it can be |
please fix pre-commit and DCO |
Signed-off-by: mobicham <hicham@mobiuslabs.com>
This pull request has merge conflicts that must be resolved before it can be |
Damn, this DCO rebase screwed up stuff, sorry folks. I will close this PR and open a new one to avoid rebase issues. |
Purpose
The goal of this small PR is to fix loading torchao models where not all the layers have been quantized. The current implementation doesn't keep track of the skipped layers defined in
config["modules_to_not_convert"]
. As a result, quantized VL models where the vision head is not quantized results in a crash.Also, made a quick fix to improve loading speed by avoiding creating an
nn.Linear
with the full tensor shape.Test Plan
Dependencies
Code
Test Result
The model should load successfully.