Fix TorchAOConfig skip layers #19147

mobicham · 2025-06-04T14:43:03Z

Purpose

The goal of this small PR is to fix loading torchao models where not all the layers have been quantized. The current implementation doesn't keep track of the skipped layers defined in config["modules_to_not_convert"]. As a result, quantized VL models where the vision head is not quantized results in a crash.

Also, made a quick fix to improve loading speed by avoiding creating an nn.Linear with the full tensor shape.

Test Plan

Dependencies

#Needs ao from this PR: https://github.com/pytorch/ao/pull/2288
USE_CPP=0 pip install git+https://github.com/mobicham/ao -v --no-build-isolation --use-pep517;
pip install git+https://github.com/mobiusml/gemlite/;

Code

import torch
from vllm import LLM
from vllm.sampling_params import SamplingParams

model_id = "mobiuslabsgmbh/Qwen2.5-VL-7B-Instruct_gemlite-ao_a16w4_gs_128_pack_32bit"
processor_args = {
    'limit_mm_per_prompt': {"image": 3}, 
    'mm_processor_kwargs': {"min_pixels": 28 * 28, "max_pixels": 1280 * 28 * 28},
    'disable_mm_preprocessor_cache': False,
}

llm = LLM(model=model_id, gpu_memory_utilization=0.9, dtype=torch.float16, max_model_len=4096, 
            max_num_batched_tokens=4096, **processor_args)

Test Result

The model should load successfully.

…atures) for faster loading

github-actions · 2025-06-04T14:43:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @mobicham, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as described by the author @mobicham, is to address an issue when loading TorchAO quantized models where not all layers have been quantized. Specifically, it aims to fix crashes that occur in models like quantized VL models where components like the vision head are intentionally left unquantized, by ensuring that the modules_to_not_convert configuration is correctly respected during the quantization process within vLLM. Additionally, the PR includes a small optimization to improve model loading speed by avoiding the creation of a large dummy torch.nn.Linear layer during quantization setup.

Highlights

Fix Partial Quantization Loading: The pull request fixes an issue where vLLM failed to correctly handle TorchAO models that were only partially quantized, leading to crashes when loading models with skipped layers (specified in modules_to_not_convert).
Respect modules_to_not_convert: The TorchAOConfig now correctly reads and utilizes the modules_to_not_convert list from the quantization configuration to identify and skip modules that should not be quantized.
Loading Speed Optimization: A minor optimization was introduced in the torchao_quantize_param_data function to improve loading speed by creating a small dummy torch.nn.Linear(1, 1) module instead of one with the full tensor dimensions when preparing parameters for quantization.

Changelog

vllm/model_executor/layers/quantization/torchao.py
- Added skip_modules attribute to TorchAOConfig to store the list of modules to exclude from quantization.
- Modified TorchAOConfig.from_config to extract the modules_to_not_convert list from the input configuration and pass it to the constructor.
- Updated TorchAOConfig.get_quant_method to check if the current module's fully qualified name (prefix) is in the skip_modules list. If a match is found, it now returns UnquantizedLinearMethod, ensuring the module is not quantized.
- Refactored torchao_quantize_param_data to create a torch.nn.Linear(1, 1) dummy module and set its in_features and out_features attributes, instead of creating a large torch.nn.Linear with the full parameter shape, which is more efficient.
- Applied minor formatting changes to function signatures and imports.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Layers skipped with care,
Quantization finds its way,
Speed up the load time.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Overall, this pull request effectively addresses the issue of loading TorchAO models with skipped layers and includes a performance optimization for model loading. The changes are well-aligned with the PR description and test plan. The code is generally readable and follows common Python practices (implicitly aligning with principles found in guides like PEP 8).

Summary of Findings

Handling skipped modules: The PR correctly adds logic to the TorchAOConfig to store and check for modules that should not be converted, ensuring that these layers are handled as unquantized linear methods.
Efficiency improvement in torchao_quantize_param_data: The change to create a smaller dummy nn.Linear layer (1,1) and then manually set in_features and out_features is a clever optimization aimed at improving model loading speed, as noted in the PR description.
Potential KeyError in from_config: The from_config method assumes the presence of the "modules_to_not_convert" key in the input configuration dictionary. If this key is missing, it will result in a KeyError.

Merge Readiness

The pull request implements the core fixes and optimizations as described. There is one medium severity issue identified regarding the potential KeyError if the "modules_to_not_convert" key is missing from the config. I recommend addressing this issue before merging. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.

vllm/model_executor/layers/quantization/torchao.py

houseroad · 2025-06-05T05:02:42Z

cc: @jerryzh168

houseroad

Can we add some unittest?

mobicham · 2025-06-05T09:10:55Z

Can we add some unittest?

@houseroad
I added a Qwen-VL loading test: 3690738

jerryzh168 · 2025-06-05T14:51:22Z

vllm/model_executor/layers/quantization/torchao.py

        self.torchao_config = torchao_config
+        self.skip_modules = skip_modules or []


can we use https://huggingface.co/docs/transformers/main/en/quantization/torchao#per-module-quantization to skip modules?

I can add that too, lemme take a look at the configs

Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).

Added here: c4201b6

You can test it with this code, it was breaking before this commit:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig model_id = "meta-llama/Llama-3.1-8B-Instruct" from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig config = Int4WeightOnlyConfig(group_size=128) quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.o_proj": None, "model.layers.13.self_attn.o_proj": None}) quantization_config = TorchAoConfig(quant_type=quant_config) quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config) print("quantized model:", quantized_model) tokenizer = AutoTokenizer.from_pretrained(model_id) q_model_id = 'quant_model_test' quantized_model.save_pretrained(q_model_id, safe_serialization=False) tokenizer.save_pretrained(q_model_id) ###################################################################################### import torch from vllm import LLM from vllm.sampling_params import SamplingParams llm = LLM(model="quant_model_test", dtype=torch.bfloat16) sampling_params = SamplingParams(max_tokens=1024, temperature=0.5, repetition_penalty=1.1, ignore_eos=False) messages = [{"content": "You are a helpful assistant", "role":"system"}, {"content":"Solve this equation x^2 + 1 = -1.", "role":"user"}] outputs = llm.chat(messages, sampling_params, chat_template=llm.get_tokenizer().chat_template) print(outputs[0].outputs[0].text)

Hmm this is kinda problematic, this example only skips quant for q_proj but in vLLM QKV are merged, so it's not gonna work. I can still add this logic for layers that are loaded exactly as they are in vLLM (like o_proj).

not sure I follow, are you saying both skipped_module and ModuleFqnToConfig can't handle this, or skipped_module can handle this but ModuleFqnToConfig can't?

It seems to me the functionality of skipped_modules is already covered by ModuleFqnToConfig and this configuration is serializable to transformer models such as https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/config.json#L102 so users don't need to worry about having extra configs, so I think we should just use that

That's not the issue, the commit above handles that. The issue is that, in vLLM, q_proj + k_proj + v_proj are merged into a single tensor, so they need to use the same quant config or all None (same for mlp). Otherwise the logic works as you can see in the example above with o_proj.

so they need to use the same quant config or all None (same for mlp).

how is this enforced? are you saying torchao config should be aware that q_proj, k_proj and v_proj are connected and set them to use the same config automatically?

vllm/model_executor/layers/quantization/torchao.py

jerryzh168

oh this is loading a model configured with module_not_to_convert, then makes sense

jerryzh168 · 2025-06-05T20:13:40Z

vllm/model_executor/layers/quantization/torchao.py

            if c is not None:
-                current_torchao_config = TorchAOConfig(c)
+                current_torchao_config = TorchAOConfig(c, self.skip_modules)


do we need to pass skip_modules here? it's already been used in L86 right

mergify · 2025-06-06T01:52:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mobicham.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jerryzh168 · 2025-06-06T01:57:17Z

please fix pre-commit and DCO

Signed-off-by: mobicham <hicham@mobiuslabs.com>

mergify · 2025-06-06T09:02:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mobicham.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mobicham · 2025-06-06T09:03:23Z

Damn, this DCO rebase screwed up stuff, sorry folks. I will close this PR and open a new one to avoid rebase issues.
Moving this to #19265

mobicham and others added 2 commits June 4, 2025 14:30

Fix layer skip in TorchAOConfig + avoid nn.Linear(in_features, out_fe…

f8be4e2

…atures) for faster loading

Merge branch 'vllm-project:main' into main

2153c73

mobicham requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners June 4, 2025 14:43

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 4, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

mobicham added 2 commits June 4, 2025 14:46

switch to config.get(modules_to_not_convert, [])

ddd8d50

fix init

4dbe05a

houseroad reviewed Jun 5, 2025

View reviewed changes

mobicham and others added 2 commits June 5, 2025 10:46

Merge branch 'vllm-project:main' into main

eef5934

add qwenvl loading test

3690738

jerryzh168 reviewed Jun 5, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

skip layers in module_fqn_to_config

c4201b6

jerryzh168 approved these changes Jun 5, 2025

View reviewed changes

jerryzh168 reviewed Jun 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 6, 2025

Merge branch 'main' into main

263c10c

mergify bot removed the needs-rebase label Jun 6, 2025

pre-commit fix

6e4c2b7

Signed-off-by: mobicham <hicham@mobiuslabs.com>

mobicham force-pushed the main branch from 6e4c2b7 to 58c2957 Compare June 6, 2025 09:01

mobicham requested review from hmellor, DarkLight1337 and ywang96 as code owners June 6, 2025 09:01

mobicham requested review from WoosukKwon, simon-mo, aarnphm, njhill, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners June 6, 2025 09:01

mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) v1 tpu Related to Google TPUs tool-calling labels Jun 6, 2025

github-project-automation bot added this to Tool Calling Jun 6, 2025

mergify bot added the needs-rebase label Jun 6, 2025

mobicham force-pushed the main branch from 58c2957 to 6e4c2b7 Compare June 6, 2025 09:05

mergify bot removed tpu Related to Google TPUs needs-rebase labels Jun 6, 2025

mobicham closed this Jun 6, 2025

github-project-automation bot moved this to Done in Tool Calling Jun 6, 2025

mobicham mentioned this pull request Jun 6, 2025

Fix TorchAOConfig skip layers #19265

Merged

		self.torchao_config = torchao_config
		self.skip_modules = skip_modules or []

Uh oh!

Fix TorchAOConfig skip layers #19147

Fix TorchAOConfig skip layers #19147

Uh oh!

Conversation

mobicham commented Jun 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Dependencies

Code

Test Result

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

houseroad commented Jun 5, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

mobicham commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jun 6, 2025

Uh oh!

jerryzh168 commented Jun 6, 2025

Uh oh!

mergify bot commented Jun 6, 2025

Uh oh!

mobicham commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mobicham commented Jun 4, 2025 •

edited by github-actions bot

Loading

mobicham commented Jun 5, 2025 •

edited

Loading

mobicham commented Jun 6, 2025 •

edited

Loading