Intern s2 preview lite awq fix bug#4600
Open
43758726 wants to merge 10 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates LMDeploy Lite quantization/calibration paths to better support Qwen3.5 / InternS2Preview architectures and to improve AWQ usability (including a “data-free” mode), alongside a small VLM utility update and a batch-splitting fix.
Changes:
- Add InternS2Preview/Qwen3.5 model build support in the VLM wrapper and fix batch splitting for Qwen3.5
position_embeddings. - Introduce
lmdeploy.lite.modelregistry-based per-architecture helpers to drive skip patterns (and some MoE parameter rewrites), and propagate skip lists intoquantization_config. - Refactor calibration loading to return the resolved HF architecture and add
calib_samples=0flow for data-free AWQ.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/vl/model/qwen3_5.py | Adds build_model() handling for Qwen3.5 and InternS2Preview VLM variants. |
| lmdeploy/lite/utils/batch_split.py | Adjusts splitting logic for Qwen3.5 position_embeddings tuple layout. |
| lmdeploy/lite/quantization/awq.py | Adds new skip-pattern plumbing and changes skip logic; extends layernorm mapping for Qwen3 MoE. |
| lmdeploy/lite/model/base.py | Introduces MODELS registry base helper for model-specific quantization support. |
| lmdeploy/lite/model/qwen.py | Registers Qwen3/Qwen3.5/InternS2Preview skip patterns and MoE conversion helper. |
| lmdeploy/lite/model/mixtral.py | Registers Mixtral helper and version-dependent skip patterns. |
| lmdeploy/lite/model/init.py | Initializes Lite model registry and imports registered helpers. |
| lmdeploy/lite/apis/smooth_quant.py | Threads trust_remote_code, consumes new calibrate return shape, and writes modules_to_not_convert. |
| lmdeploy/lite/apis/calibrate.py | Refactors model/tokenizer loading, expands supported model maps, and returns arch. |
| lmdeploy/lite/apis/auto_awq.py | Adds calib_samples=0 data-free mode and uses per-arch helpers/skip list propagation. |
| lmdeploy/cli/utils.py | Updates CLI help text to document --calib-samples 0. |
| lmdeploy/archs.py | Removes workspace (TurboMind converted model) shortcut from get_task(). |
Comments suppressed due to low confidence (1)
lmdeploy/archs.py:146
get_task()no longer handles local TurboMind converted/workspace model directories (typically containingtriton_models/weights). Without this short-circuit, callingget_task()on a converted TurboMind model path will fall through toget_model_arch()and likely fail because there is no HF config to load. Please restore the workspace detection (or add equivalent handling inget_model_arch()) so converted TurboMind models continue to be recognized correctly.
def get_task(backend: str, model_path: str, trust_remote_code: bool = False):
"""Get pipeline type and pipeline class from model config."""
from lmdeploy.serve.core import AsyncEngine
_, config = get_model_arch(model_path, trust_remote_code=trust_remote_code)
if check_vl_llm(backend, config.to_dict()):
from lmdeploy.serve.core import VLAsyncEngine
return 'vlm', VLAsyncEngine
# default task, pipeline_class
return 'llm', AsyncEngine
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
19
to
+23
| 'Qwen2ForCausalLM': 'Qwen2DecoderLayer', | ||
| 'Qwen3ForCausalLM': 'Qwen3DecoderLayer', | ||
| 'Qwen3MoeForCausalLM': 'Qwen3MoeDecoderLayer', | ||
| 'Qwen3_5ForConditionalGeneration': 'Qwen3_5DecoderLayer', | ||
| 'Qwen3_5MoeForConditionalGeneration': 'Qwen3_5MoeDecoderLayer', |
| 'LlavaLlamaForCausalLM': 'LlamaDecoderLayer', | ||
| 'MGMLlamaForCausalLM': 'LlamaDecoderLayer', # mini gemini | ||
| 'InternLMXComposer2ForCausalLM': 'InternLM2DecoderLayer', | ||
| 'InternS2PreviewForConditionalGeneration': 'InternS2PreviewDecoderLayer', |
| 'Qwen3MoeDecoderLayer': { | ||
| 'input_layernorm': ['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'], | ||
| 'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj'] | ||
| }, |
Comment on lines
+137
to
+141
| """ | ||
|
|
||
| patterns.extend(SKIPPED_MODULE) | ||
|
|
||
| def skipped_module(name: str): | ||
| """Whether the module should be skipped from quantization.""" | ||
| for m in SKIPPED_MODULE: | ||
| if m in name: | ||
| return True | ||
| return False | ||
| return next(((True, pattern) for pattern in patterns if pattern in name), (False, None)) |
Comment on lines
+183
to
+186
| def get_task(backend: str, model_path: str): | ||
| """Get pipeline type and pipeline class from model config.""" | ||
|
|
||
| _, config = get_model_arch(model_path) |
| torch.cuda.empty_cache() | ||
| patterns = [] | ||
| skipped_modules = [] | ||
| arch = model.config.architectures[0] |
|
|
||
| @classmethod | ||
| def skipped_modules(cls): | ||
| pass |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist