[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control #7086

Qubitium · 2024-08-02T18:12:03Z

GPTQModel v0.9.10-dev0 main branch has merged dynamic, per layer/module support of different gptq bits, sym, desc_act using a regex style definition. This is a work in process and we are awaiting feedback before release. We are targeting both vllm and sglang compat with the quant so would like to work with vllm to see if what is the best way forward.

Previously a gptq model has a single config that applies to all layers and all modules within nested layers. This change allows pin-point targeting of different gptq quantization config for specific layers and/or specific modules within specific layers for better optimization.

Sample model: https://huggingface.co/ModelCloud/TinyLlama-1.1B-Chat-v1.0-dynamic-GPTQ-2024-8-3

full quant config for sample:

{
  "bits": 4,
  "dynamic": {
    ".*\\.(?:1[0-5])\\..*": {
      "bits": 8
    },
    ".*\\.(?:1[6-9]|20|21)\\..*": {
      "bits": 8,
      "group_size": 64
    }
  },
  "group_size": 128,
  "desc_act": true,
  "static_groups": false,
  "sym": true,
  "lm_head": false,
  "damp_percent": 0.005,
  "damp_auto_increment": 0.0015,
  "true_sequential": true,
  "model_name_or_path": "./test_dynamic_model",
  "model_file_base_name": "model",
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "meta": {
    "quantizer": "gptqmodel:0.9.10-dev0"
  }
}

Dynamic config explained:

# sample tinyllama 1.1B model has 22 layers
# default is 4bit, group_size 128
# layer index start at 0

# last 1/2 of the layers 10-21 has 8bit vs 4bit for 0-9
# last 1/4 of the layers 16-21 has 8bit and group_size 64
dynamic = {
  # `.*\.` matches the layers_node prefix
  r".*\.(?:1[0-5])\..*": {"bits": 8,}, # match layer 10-15
  r".*\.(?:1[6-9]|20|21)\..*": {"bits": 8, "group_size": 64,}, # match layer 16-21
}

Same code to quantize using dynamic control: https://github.com/ModelCloud/GPTQModel/blob/main/tests/test_dynamic.py

Design choices:

Need a def table to notify quantizer (GPTQModel) and infer engine (vllm) which layers has dynamic (override) quant config.
Possible to generate a static all inclusive per layer/module def/table in json but content would not be human friendly as each nested layer with each nested module would need an entry. If a model has 44 layers and each layer has 6-8 modules, we are looking a t a 44x8 lines of json minimum.
GPTQModel decided on a design where a simple regex: str key mapped to dict[str, int or bool] for both quantization and model inference/loading. Multiple regex/dynamic pairs can be defined and for matching, the rules are looped and first one that match, is applied.
Upload loading, and looping over each layer/module, we check for dynamic (override) match and if matches, override the static quant config files for that layer/module.

Compat Notes:

dynamic config require that the model inference does not remerge the layers with different dyanmic/quant param values. MergedColumnParallel in Llama model in vllm for example merges mlp.gate and mlp.up. Dynamic override works but in this case, because they are fused/merged, these two layers must have exact same quant config values. Can't have one with 4bit and the other with 8bits.

TODO:

unit test
finalize design of loading so GPTQModel and vllm can agree, how to best pass/share dynamic layer/module quant override config via quantize_config json

cleanup cleanup cleanup

…_dynamic_bits

github-actions · 2024-08-02T18:12:16Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2024-08-05T16:02:00Z

Hi @Qubitium thanks for sharing your interesting work!

We have the notion of variable quantization already in vLLM through our compressed-tensors integration. With this we can blend integer and float quantization of weights and/or activations within a single model config in a similar explicit target or regex manner. I recommend digging into the non-uniform support we already have for compressed-tensors and fp8 methods.
It would be interesting to see if your library could export into compressed-tensors format so it would work out-of-the-box in vLLM and Transformers!

Regarding merged layers, I think the performance and complexity cost of needing to support possibly unmerging layers like QKV or GateUp is too high. I want to recommend keeping the quantization level of merged layers the same so we (and several other inference engines) don't run into this issue.

If you are still open to editing your format, I also think dynamic isn't a clear term here since there is already the notion of static or dynamic quantization, which means something else. Also, the quantization isn't changing in any dynamic way. I would recommend using a name like non-uniform quantization, since we are not performing uniform quantization anymore but have settled on a non-uniform scheme.

Qubitium · 2024-08-05T17:07:03Z

@mgoin Wow, I totally missed this PR. After cursory check of the https://github.com/vllm-project/vllm/pull/6515/files PR, our pr is entirely redundant. The core concept is similar including re matching. The only little advantage of this pr, and very little at this point, is minimal code-change to bootstrap gptq flexible layer/module quant.

I will need to digest the vllm pr/unit tests to test with gptqmodel export. If gptq model can integrate with compressed_config protocol, then there is zero reason for this pr.

Regarding merged layers, I think the performance and complexity cost of needing to support possibly unmerging layers like >QKV or GateUp is too high. I want to recommend keeping the quantization level of merged layers the same so we (and >several other inference engines) don't run into this issue.

Yes, this our finding as well. Merged layers should retain the same scheme.

If you are still open to editing your format, I also think dynamic isn't a clear term here since there is already the notion of >static or dynamic quantization, which means something else. Also, the quantization isn't changing in any dynamic way. I >would recommend using a name like non-uniform quantization, since we are not performing uniform quantization ?>anymore but have settled on a non-uniform scheme.

I want the config to to be compatible to vllm/sglang, and since sglang for the most part re-uses/import vllm model weight/model layers. Do not want another protocol parser so if vllm compressed_config protocol works like I think it does, then this is good base moving forward for gptqmodel as well.

mergify · 2024-12-24T05:50:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Qubitium.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Qubitium · 2025-02-11T03:55:53Z

@mgoin Fixed. Turned out we had to add # noqa: to both lines if a logical line was split into two lines.

Qubitium · 2025-02-11T05:53:47Z

@mgoin Not ready. Found bug during our internal testing. For non-Marlin kernel code path,dynamic is not correctly applied.

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Qubitium · 2025-02-11T09:18:27Z

@mgoin Ready for re-view. Sorry this is taking too long. We want to get it right.

Changes since your last review:

Dynamic override get/logic moved to gptq_utils.py
Fixed gptq cuda kernel not applying dynamic correctly: previously only Marlin kernel was fully tested
Cleaned up related ci tests
Lint/pre-commit CI passing

mgoin

Thanks for fixing the issues, I think this should be good to go if the CI is green with the GPTQ changes.

vllm/model_executor/layers/quantization/utils/gptq_utils.py

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

mgoin · 2025-02-12T17:16:12Z

Thank you! The failing tests are unrelated so we will merge shortly

* [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (vllm-project#12713) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Refactor `Linear` handling in `TransformersModel` (vllm-project#12727) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (vllm-project#12729) * [Misc] Bump the compressed-tensors version (vllm-project#12736) * [Model][Quant] Fix GLM, Fix fused module mappings for quantization (vllm-project#12634) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [Doc] Update PR Reminder with link to Developer Slack (vllm-project#12748) * [Bugfix] Fix OpenVINO model runner (vllm-project#12750) * [V1][Misc] Shorten `FinishReason` enum and use constant strings (vllm-project#12760) * [Doc] Remove performance warning for auto_awq.md (vllm-project#12743) * [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (vllm-project#12546) * [core][distributed] exact ray placement control (vllm-project#12732) Signed-off-by: youkaichao <youkaichao@gmail.com> * The code assumes WARP_SIZE to be equal to 32, which is not the case on ROCm (#406) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Merging PR vllm-project#12536 Merged via CLI script * [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) * Add: Support for Sparse24Bitmask Compressed Models * [VLM] Use shared field to pass token ids to model * [Docs] Drop duplicate [source] links * [VLM] Qwen2.5-VL * [VLM] Update compatibility with transformers 4.49 * [ROCm][Kernel] Using the correct warp_size value * [Bugfix] Better FP8 supported defaults * [Misc][Easy] Remove the space from the file name * [Model] LoRA Support for Ultravox model (vllm-project#11253) * [Bugfix] Fix the test_ultravox.py's license (vllm-project#12806) Signed-off-by: Lu Fang <lufang@fb.com> * Improve `TransformersModel` UX (vllm-project#12785) * [Misc] Remove duplicated DeepSeek V2/V3 model definition (vllm-project#12793) * [Misc] Improve error message for incorrect pynvml (vllm-project#12809) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Update w2 scale loading for GPTQMarlinMoE (vllm-project#12757) * [Docs] Add Google Cloud Slides (vllm-project#12814) * [Attention] Use FA3 for MLA on Hopper (vllm-project#12807) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [misc] Reduce number of config file requests to HuggingFace (vllm-project#12797) Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * Update README.md 20250205_aiter (#407) * Update README.md 20250205_aiter * whitespace * adding VLLM_USE_AITER=0 advice * [Misc] Remove unnecessary decode call (vllm-project#12833) * [Kernel] Make rotary_embedding ops more flexible with input shape (vllm-project#12777) * [torch.compile] PyTorch 2.6 and nightly compatibility (vllm-project#12393) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] double quote cmake package in build.inc.md (vllm-project#12840) * [Bugfix] Fix unsupported FA version check for Turing GPU (vllm-project#12828) * [V1] LoRA Support (vllm-project#10957) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * Add Bamba Model (vllm-project#10909) Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [MISC] Check space in the file names in the pre commit checks (vllm-project#12804) Signed-off-by: Lu Fang <lufang@fb.com> * [misc] Revert # 12833 (vllm-project#12857) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * [Bugfix] FA2 illegal memory access (vllm-project#12848) * Make vllm compatible with verl (vllm-project#12824) Co-authored-by: zhangshulai <zhangshulai@bytedance.com> * [Bugfix] Missing quant_config in deepseek embedding layer (vllm-project#12836) * Prevent unecessary requests to huggingface hub (vllm-project#12837) * [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (vllm-project#12880) Signed-off-by: Lu Fang <lufang@fb.com> * [Misc] Remove unnecessary detokenization in multimodal processing (vllm-project#12868) * PR vllm-project#12718 (vllm-project#12718) * [V1] Logprobs and prompt logprobs support (vllm-project#9880) This PR is adding support for sample logprobs & prompt logprobs to vLLM v1. New behavior: - During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order. - In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized. - During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.) - Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer. Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501) * fix rocm get_device name for moe configs (#359) * fix rocm get_device name use 'market_name' hard-code names for mi308 & mi300 * use gfx and num_CU for device name * using market_name * rename MI325_OAM to MI325X * rm (duplicate) MI300X_OAM * rename mi308 * [V1] LM Eval With Streaming Integration Tests (vllm-project#11590) * [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (vllm-project#12723) Signed-off-by: Lu Fang <lufang@fb.com> * [V1][Minor] Remove outdated comment (vllm-project#12928) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1] Move KV block hashes from Request to KVCacheManager (vllm-project#12922) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (vllm-project#12905) * [Misc] Fix typo in the example file (vllm-project#12896) Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com> * [Bugfix] Fix multi-round chat error when mistral tokenizer is used (vllm-project#12859) Signed-off-by: Zifei Tong <zifeitong@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [bugfix] respect distributed_executor_backend in world_size=1 (vllm-project#12934) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Add offline test for disaggregated prefill (vllm-project#12418) * [V1][Minor] Move cascade attn logic outside _prepare_inputs (vllm-project#12943) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Build] Make pypi install work on CPU platform (vllm-project#12874) * [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (vllm-project#12812) Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> * [misc] Add LoRA to benchmark_serving (vllm-project#12898) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Misc] Log time consumption on weight downloading (vllm-project#12926) * [CI] Resolve transformers-neuronx version conflict (vllm-project#12925) * [Doc] Correct HF repository for TeleChat2 models (vllm-project#12949) * [Misc] Add qwen2.5-vl BNB support (vllm-project#12944) * [CI/Build] Auto-fix Markdown files (vllm-project#12941) * [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (vllm-project#12935) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [bugfix] fix early import of flash attention (vllm-project#12959) Signed-off-by: youkaichao <youkaichao@gmail.com> * [VLM] Merged multi-modal processor for GLM4V (vllm-project#12449) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [V1][Minor] Remove outdated comment (vllm-project#12968) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [RFC] [Mistral] FP8 format (vllm-project#10130) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [V1] Cache `uses_mrope` in GPUModelRunner (vllm-project#12969) * [core] port pynvml into vllm codebase (vllm-project#12963) Signed-off-by: youkaichao <youkaichao@gmail.com> * [MISC] Always import version library first in the vllm package (vllm-project#12979) Signed-off-by: Lu Fang <lufang@fb.com> * [core] improve error handling when wake up from sleep mode (vllm-project#12981) Signed-off-by: youkaichao <youkaichao@gmail.com> * [core][rlhf] add colocate example for RLHF (vllm-project#12984) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Use msgpack for core request serialization (vllm-project#12918) Signed-off-by: Nick Hill <nhill@redhat.com> * Check if selected backend is None in get_attn_backend_cls() (vllm-project#12975) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [core] fix sleep mode and pytorch checkpoint compatibility (vllm-project#13001) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] Add link to tool_choice tracking issue in tool_calling.md (vllm-project#13003) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [misc] Add retries with exponential backoff for HF file existence check (vllm-project#13008) * [Bugfix] Clean up and fix multi-modal processors (vllm-project#13012) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Fix seed parameter behavior in vLLM (vllm-project#13007) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * Fixing the output formatting (#414) * [Model] Ultravox Model: Support v0.5 Release (vllm-project#12912) Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai> * [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU (vllm-project#13022) Signed-off-by: kevin <kevin@anyscale.com> * [V1][Minor] Move scheduler outputs to a separate file (vllm-project#13062) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Docs] Annouce Meta Meetup (vllm-project#13065) Signed-off-by: simon-mo <simon.mo@hey.com> * [Bugfix] Support missing tool parameters in mistral tokenizer (vllm-project#12884) Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com> * [Benchmark] Add BurstGPT to benchmark_serving (vllm-project#13063) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> * [Core] Don't do platform detection at import time (vllm-project#12933) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] LoRA - Refactor Punica ops tests (vllm-project#12970) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix]: Reasoning output bug according to the chat template change (vllm-project#13025) Signed-off-by: Ce Gao <cegao@tensorchord.ai> * [V1][Metrics] Add GPU prefix cache hit rate % gauge (vllm-project#12592) * [executor] init `local_rank` as device index (vllm-project#13027) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [ROCm] Using a more precise memory profiling (vllm-project#12624) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Build] Fix cuda link target of cumem_allocator in CPU env (vllm-project#12863) Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Platform] add pre_register_and_update function (vllm-project#12432) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] fix flaky test (vllm-project#13089) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * [V1][Metrics] Add several request timing histograms (vllm-project#12644) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * Set `torch_dtype` in `TransformersModel` (vllm-project#13088) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Misc] Fix typo at comments at metrics.py (vllm-project#13024) * [Bugfix] Do not use resource module on Windows (vllm-project#12858) (vllm-project#13029) * [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (vllm-project#12962) Signed-off-by: Hollow Man <hollowman@opensuse.org> * Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (vllm-project#13023) * Add tuned moe config for qwen1.5_moe_A2.7B (#398) * Add tuned moe config for qwen1.5_moe_A2.7B * Add more sweep parameters on qwen2_moe * Add tp = 1,2,4,8 after applying PR12838 * Rename config name by deleting "_OAM" --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> * [CI/Build][Bugfix] Fix CPU backend default threads num (vllm-project#13077) * Removing non-existent parameter * [Doc] Improve OpenVINO installation doc (vllm-project#13102) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar (vllm-project#12976) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [Misc] Move pre-commit suggestion back to the end (vllm-project#13114) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (vllm-project#12518) Signed-off-by: Keyun Tong <tongkeyun@gmail.com> * [Model] IBM/NASA Prithvi Geospatial model (vllm-project#12830) * [ci] Add more source file dependencies for some tests (vllm-project#13123) Signed-off-by: <> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> * [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com> * Bump helm/kind-action from 1.10.0 to 1.12.0 (vllm-project#11612) * Bump actions/stale from 9.0.0 to 9.1.0 (vllm-project#12462) * Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (vllm-project#12463) * Bump actions/setup-python from 5.3.0 to 5.4.0 (vllm-project#12672) * Further reduce the HTTP calls to huggingface.co (vllm-project#13107) * [Misc] AMD Build Improvements (vllm-project#12923) * [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (vllm-project#13108) * [Bugfix] Fix num video tokens calculation for Qwen2-VL (vllm-project#13148) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Frontend] Generate valid tool call IDs when using `tokenizer-mode=mistral` (vllm-project#12332) * [Misc] Delete unused LoRA modules (vllm-project#13151) * Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (vllm-project#12998) Signed-off-by: Lu Fang <lufang@fb.com> * [CI/Build] Use mypy matcher for pre-commit CI job (vllm-project#13162) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Update Benchmark Profiling Scripts (#417) * Update profiling benchmarks * Fix linter errors --------- Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> * [CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control (vllm-project#7086) * [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (vllm-project#13119) * DS V2V3 fix for same file * Lint * updating manfiest (#416) * [CI] Fix failing FP8 cpu offload test (vllm-project#13170) Signed-off-by: mgoin <mgoin64@gmail.com> * Aiter base (#419) * Using upstream FA repo. Building aiter in the base docker image * Renaming the file to match upstream naming * [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (vllm-project#13173) Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> * [CI/Build] Ignore ruff warning up007 (vllm-project#13182) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (vllm-project#12706) * [NVIDIA] Support nvfp4 quantization (vllm-project#12784) * [Bugfix][Example] Fix GCed profiling server for TPU (vllm-project#12792) Signed-off-by: mgoin <michael@neuralmagic.com> * [VLM] Implement merged multimodal processor for Mllama (vllm-project#11427) * Simplify logic of locating CUDART so file path (vllm-project#13203) Signed-off-by: Lu Fang <lufang@fb.com> * [Build] Automatically use the wheel of the base commit with Python-only build (vllm-project#13178) * [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (vllm-project#13097) * [Frontend] Move CLI code into vllm.cmd package (vllm-project#12971) * Allow Unsloth Dynamic 4bit BnB quants to work (vllm-project#12974) * [CI/Build] Allow ruff to auto-fix some issues (vllm-project#13180) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [V1][core] Implement pipeline parallel on Ray (vllm-project#12996) * [VLM] Remove input processor from clip and siglip (vllm-project#13165) * [Frontend] Pass pre-created socket to uvicorn (vllm-project#13113) * [V1] Clarify input processing and multimodal feature caching logic (vllm-project#13211) * [VLM] Merged multi-modal processor for Molmo (vllm-project#12966) * [V1][Core] Add worker_base for v1 worker (vllm-project#12816) Signed-off-by: Aoyu <aoyuzhan@amazon.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Misc] Qwen2.5-VL Optimization (vllm-project#13155) * [VLM] Separate text-only and vision variants of the same model architecture (vllm-project#13157) * [Bugfix] Missing Content Type returns 500 Internal Server Error (vllm-project#13193) * [Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (vllm-project#12909) * Initial attempt to adjust codeowners to the ROCm fork (#420) * Applying weight padding to deepseek (#421) * Add label if pre-commit passes (vllm-project#12527) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Model] DeepSeek Tunings (#423) * fused_moe config for DSv3 on MI300X updated * Add tuning script and post processing script Signed-off-by: Randall Smith <Randall.Smith@amd.com> * Add modification to fp8_utils for tuning Signed-off-by: Randall Smith <Randall.Smith@amd.com> * update tuning script and add the configs Signed-off-by: Randall Smith <Randall.Smith@amd.com> * slightly better tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * benchmark_moe.py is updated to generate more accurate MoE configs and a specific MoE config for DSv3 is added * Bug in sgl_moe_align_block_size() is fixed by Greg * Generate fp8_w8a8 config for MI300XHF * tunings that don't give garbage output Signed-off-by: Randall Smith <Randall.Smith@amd.com> * More accurate tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * More accurate tunings and reject inaccurate configs Signed-off-by: Randall Smith <Randall.Smith@amd.com> * add new tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * rename tuning script and add benchmark script to use for optimizing blockwise quant Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove white space from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove white space from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * Remove some unnecessary changes Signed-off-by: Randall Smith <Randall.Smith@amd.com> * don't use space in file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove XHF tunings Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove OAM from file name Signed-off-by: Randall Smith <Randall.Smith@amd.com> * rmeove OAM from file names Signed-off-by: Randall Smith <Randall.Smith@amd.com> * yapf Signed-off-by: Randall Smith <Randall.Smith@amd.com> * update config name Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove benchmark_moe.py changes Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove is_contiguous Signed-off-by: Randall Smith <Randall.Smith@amd.com> * use more recent fp8_utils.py Signed-off-by: Randall Smith <Randall.Smith@amd.com> * remove is_contiguous Signed-off-by: Randall Smith <Randall.Smith@amd.com> --------- Signed-off-by: Randall Smith <Randall.Smith@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> * Optimize moe_align_block_size for deepseek_v3 (vllm-project#12850) Signed-off-by: mgoin <mgoin64@gmail.com> * [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (vllm-project#13198) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Revert "Add label if pre-commit passes" (vllm-project#13242) * [ROCm] Avoid using the default stream on ROCm (vllm-project#13238) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Kernel] Fix awq error when n is not divisable by 128 (vllm-project#13227) * [V1] Consolidate MM cache size to vllm.envs (vllm-project#13239) * [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (vllm-project#13250) * [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (vllm-project#13237) * [Bugfix] Offline example of disaggregated prefill (vllm-project#13214) * [Misc] Remove redundant statements in scheduler.py (vllm-project#13229) * Consolidate Llama model usage in tests (vllm-project#13094) * Expand MLA to support most types of quantization (vllm-project#13181) * [V1] LoRA - Enable Serving Usecase (vllm-project#12883) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [ROCm][V1] Add intial ROCm support to V1 (vllm-project#12790) * [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (vllm-project#13126) * [WIP] TPU V1 Support Refactored (vllm-project#13049) * [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (vllm-project#12927) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> * [Bugfix] Fix missing parentheses (vllm-project#13263) * [Misc] Log time consumption of sleep and wake-up (vllm-project#13115) Signed-off-by: Jun Duan <jun.duan.phd@outlook.com> * [VLM] Keep track of whether prompt replacements have been applied (vllm-project#13215) * [V1] Simplify GPUModelRunner._update_states check (vllm-project#13265) * Support logit_bias in v1 Sampler (vllm-project#13079) * [Core] choice-based structured output with xgrammar (vllm-project#12632) * [Hardware][Gaudi][Bugfix] Fix error for guided decoding (vllm-project#12317) * Removing bad config (#425) * The order in the file is important. One needs to be explicitly be added to each following path for their ownership to apply (#427) * [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (vllm-project#13236) Signed-off-by: mgoin <mgoin64@gmail.com> * [Core] Reduce TTFT with concurrent partial prefills (vllm-project#10235) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [V1][Core] min_p sampling support (vllm-project#13191) Signed-off-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> * [V1][CI] Fix failed v1-test because of min_p (vllm-project#13316) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][Sampler] Don't apply temp for greedy-only (vllm-project#13311) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1][PP] Fix memory profiling in PP (vllm-project#13315) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (vllm-project#13235) * [Bugfix][Docs] Fix offline Whisper (vllm-project#13274) * [Bugfix] Massage MLA's usage of flash attn for RoCM (vllm-project#13310) * [BugFix] Don't scan entire cache dir when loading model (vllm-project#13302) * [Bugfix]Fix search start_index of stop_checker (vllm-project#13280) * [Bugfix] Fix qwen2.5-vl image processor (vllm-project#13286) * [V1][Metrics] Add iteration_tokens_total histogram from V0 (vllm-project#13288) * [AMD] [Model] DeepSeek tunings (vllm-project#13199) * [V1][PP] Run engine busy loop with batch queue (vllm-project#13064) * [ci/build] update flashinfer (vllm-project#13323) * [Doc] [2/N] Add Fuyu E2E example for multimodal processor (vllm-project#13331) * [V1][Spec Decode] Ngram Spec Decode (vllm-project#12193) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> * [Quant] Add `SupportsQuant` to phi3 and clip (vllm-project#13104) * [Bugfix] Pin xgrammar to 0.1.11 (vllm-project#13338) * avoid calling hf_list_repo_files for local model Signed-off-by: isotr0py <2037008807@qq.com> * annotation Signed-off-by: isotr0py <2037008807@qq.com> * [BugFix] Enhance test_pos_encoding to support execution on multi-devices (vllm-project#13187) Signed-off-by: wchen61 <wchen61@foxmail.com> * [V1] Update doc and examples for H2O-VL (vllm-project#13349) Signed-off-by: Roger Wang <ywang@roblox.com> * [ci] skip failed tests for flashinfer (vllm-project#13352) Signed-off-by: youkaichao <youkaichao@gmail.com> * [platform] add base class for communicators (vllm-project#13208) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Fix 2 Node and Spec Decode tests (vllm-project#13341) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Docs] Change myenv to vllm. Update python_env_setup.inc.md (vllm-project#13325) * [V1][BugFix] Add __init__.py to v1/spec_decode/ (vllm-project#13359) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][PP] Cache Intermediate Tensors (vllm-project#13353) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case (vllm-project#13358) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][BugFix] Clean up rejection sampler & Fix warning msg (vllm-project#13362) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [V1][Misc] Avoid unnecessary log output (vllm-project#13289) * [Feature][Spec Decode] Simplify the use of Eagle Spec Decode (vllm-project#12304) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix spelling error in index.md (vllm-project#13369) * Run v1 benchmark and integrate with PyTorch OSS benchmark database (vllm-project#13068) Signed-off-by: Huy Do <huydhn@gmail.com> * [MISC] tiny fixes (vllm-project#13378) * [VLM] Check required fields before initializing field config in `DictEmbeddingItems` (vllm-project#13380) * [Model] Support Mamba2 (Codestral Mamba) (vllm-project#9292) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * [Bugfix] fix xpu communicator (vllm-project#13368) Signed-off-by: yan ma <yan.ma@intel.com> * [Bugfix] Fix VLLM_USE_MODELSCOPE issue (vllm-project#13384) * Updating PR template to point people to the upstream repo. Updating codeowners (#431) * Enabling the ROCm-vLLM CI on MI250 machines (#432) * Enabling ROCm CI on MI250 machines: - correct build target - correct queue Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> --------- Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * Optimization for quantized gemm skinny sizes (#411) * Optimization for quantized gemm skinny sizes * lint fix * Add support for bf16/fp16 * code cleanup * code cleanup * lint fix2 * cleanup * Moved the logic into tuned gemm to preserve API compatibility --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Restricting FP8 wvSplitk to MI300x (#439) * Remove mi300a (#440) * Removing gfx940 and gfx941 targets. These have been deprecated in favor of gfx942 for MI300X Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * Remove from custom kernels as well --------- Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * resolve diff for mixtral8x7B configs (#437) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * Torch version bump to fix tunable ops (#442) * Advance torch commit to be past pytorch/pytorch#144942 to fix tunable ops * Make sure to use the submodule commit compatible with the main aiter commit * bugfix: remove unused argument passed to the forward pass of ReplicatedLinear layer Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Signed-off-by: <> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com> Signed-off-by: Zifei Tong <zifeitong@gmail.com> Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Ce Gao <cegao@tensorchord.ai> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Lingfan Yu <lingfany@amazon.com> Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com> Signed-off-by: Aoyu <aoyuzhan@amazon.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> Signed-off-by: Jun Duan <jun.duan.phd@outlook.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: isotr0py <2037008807@qq.com> Signed-off-by: wchen61 <wchen61@foxmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Akash kaothalkar <61960177+Akashcodes732@users.noreply.github.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Sumit Vij <sumitvij11+github@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Jitse Klomp <jitse@jitseklomp.nl> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: ZSL98 <36250440+ZSL98@users.noreply.github.com> Co-authored-by: zhangshulai <zhangshulai@bytedance.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Amit Garg <mitgarg17495@gmail.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Ke Zhao <yingxiongraomingzk@gmail.com> Co-authored-by: zifeitong <zifeitong@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Shaoting <shaotingf@uchicago.edu> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Jun Duan <jun.duan.phd@outlook.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Farzad Abdolhosseini <farzad.abdolhosseini@gmail.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Florian Greinacher <florian.greinacher@siemens.com> Co-authored-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Jewon Lee <105219284+je1lee@users.noreply.github.com> Co-authored-by: MoonRide303 <130458190+MoonRide303@users.noreply.github.com> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Co-authored-by: sky0530 <weiching0530@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: Christian Pinto <chrpinto@gmail.com> Co-authored-by: Lingfan Yu <lingfany@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Shiyan Deng <842974287@qq.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Adrian Abeyta <adabeyta@amd.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: Yida Wu <yida.wu@amd.com> Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com> Co-authored-by: LikeSundayLikeRain <monsoon1013@gmail.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Aoyu <aoyuzhang1989@gmail.com> Co-authored-by: Aoyu <aoyuzhan@amazon.com> Co-authored-by: 燃 <wulipc@163.com> Co-authored-by: Vaibhav Jain <vajain@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: qli88 <qiang.li2@amd.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Wang Ran (汪然) <wrran@outlook.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Kero Liang <kerorek@outlook.com> Co-authored-by: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io> Co-authored-by: Xu Song <xusong.vip@gmail.com> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: isotr0py <2037008807@qq.com> Co-authored-by: wchen61 <wchen61@foxmail.com> Co-authored-by: 凌 <i@ioioi.cn> Co-authored-by: yankooo <948162199@qq.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: r.4ntix <antix.blue@gmail.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

liweiqing1997 · 2025-03-13T02:15:42Z

Can this method support MOE models such as DeepSeekv3?

Qubitium · 2025-03-13T02:18:03Z

@liweiqing1997 There is no limitation to this feature. If you use GPTQModel for quantization, you can switch any module to any bit rate/group_size. It is not limited to specific model.

liweiqing1997 · 2025-03-13T08:22:48Z

@Qubitium

The issue described involves a KeyError when attempting to deploy a quantized DeepSeek-V2-Lite-Chat model using the vllm framework. The error occurs during the weight-loading process, where the key names in the model's named_parameters() dictionary (params_dict) do not match the key names in the quantized weight file. Specifically:

Key Mismatch:

The original key in self.named_parameters() is 'model.layers.21.mlp.experts.w2_qweight'.
The processed key in the quantized weight file is 'model.layers.21.mlp.experts.w2_weight'.

Error Cause:

When loading the weights, the code attempts to access param = params_dict[name], but the name from the weight file does not exist in params_dict, resulting in a KeyError.

How can this issue be resolved?

Error stack trace:


File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/utils.py", line 2238, in run_method
    return func(*args, **kwargs)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/model_runner.py", line 1113, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/loader.py", line 426, in load_model
    loaded_weights = model.load_weights(
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/models/deepseek_v2.py", line 790, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.10.mlp.experts.w2_weight

My dynamic quantization settings are set to the default configuration from the gptqmodel homepage:

python

dynamic = {
# .*\. matches the layers_node prefix
# layer index starts at 0

# positive match: layer 19, gate module 
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  

# positive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  

# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {}, 

# negative match: skip all down modules for all layers
r"-:.*down.*": {},

}

The config after quantization is:


{
  "_name_or_path": "/mnt/models/deepseek-ai/DeepSeek-V2-Lite-Chat/",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 10944,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1408,
  "moe_layer_freq": 1,
  "n_group": 1,
  "n_routed_experts": 64,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 16,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 27,
  "num_key_value_heads": 16,
  "pretraining_tp": 1,
  "q_lora_rank": null,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 8,
    "checkpoint_format": "gptq",
    "desc_act": false,
    "dynamic": {
      "+:.*\\.18\\..*gate.*": {
        "bits": 4,
        "group_size": 32
      },
      "-:.*\\.20\\..*gate.*": {},
      "-:.*down.*": {},
      ".*\\.19\\..*gate.*": {
        "bits": 8,
        "group_size": 64
      }
    },
    "group_size": 64,
    "lm_head": false,
    "meta": {
      "damp_auto_increment": 0.0025,
      "damp_percent": 0.01,
      "mse": 0.0,
      "quantizer": [
        "gptqmodel:2.0.0-dev"
      ],
      "static_groups": false,
      "true_sequential": true,
      "uri": "https://github.com/modelcloud/gptqmodel"
    },
    "pack_dtype": "int32",
    "quant_method": "gptq",
    "sym": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 1.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 1,
  "topk_method": "greedy",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.3",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Qubitium · 2025-03-13T08:29:01Z

@liweiqing1997 Please open an issue at GPTQmodel.

please use latest gptqmodel buid from main
first test loading using GPTQModel.load api

I need to isolate issue to gptqmodel first before checking is vllm compat issue

liweiqing1997 · 2025-03-13T08:52:46Z

@Qubitium

GPTQModel.load is ok

…ule override/control (vllm-project#7086) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…ule override/control (vllm-project#7086)

ZX-ModelCloud and others added 11 commits August 1, 2024 14:20

gptq_marlin compat dynamic_bits quantize config

f470b26

Merge branch 'main' into compat_dynamic_bits

c56e3de

Update gptq_marlin.py

502edb3

cleanup

18064cd

cleanup

1b132c3

cleanup

4b63754

cleanup

90258d2

cleanup

a5d3c8b

cleanup cleanup cleanup

Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…

c84793f

…_dynamic_bits

load "dynamic" field from config

5682124

fix key error: change "is_sym" to "sym"

d651668

Qubitium mentioned this pull request Aug 2, 2024

[FEATURE] Add dynamic suppor for AutoRound quantiztion ModelCloud/GPTQModel#329

Closed

ZX-ModelCloud added 4 commits August 6, 2024 02:56

Merge branch 'main' into compat_dynamic_bits

9a36694

Merge branch 'main' into compat_dynamic_bits

fbc594f

update quant_type

e9ae8f5

update

19d7772

mergify bot added the needs-rebase label Dec 24, 2024

Merge branch 'main' into compat_dynamic_bits

7057dbb

mergify bot removed the needs-rebase label Dec 24, 2024

ZX-ModelCloud and others added 7 commits December 24, 2024 16:15

fix judgment error

8565328

cleanup

84ada54

cleanup

e81a7da

cleanup

68291ce

cleanup

7867405

cleanup

c63ba51

Update gptq_marlin.py

5f9b712

ZX-ModelCloud added 3 commits February 11, 2025 08:28

Extract code to gptq_utils.get_linear_quant_method()

e3084e3

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

cleanup

25dbd5a

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

cleanup

874076c

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Merge branch 'main' into compat_dynamic_bits

17704df

mgoin approved these changes Feb 11, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/gptq_utils.py Outdated Show resolved Hide resolved

mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Feb 11, 2025

do not use Fraction

c7f10be

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

mgoin added the force-merge label Feb 12, 2025

DarkLight1337 enabled auto-merge (squash) February 12, 2025 17:19

simon-mo merged commit 36a0863 into vllm-project:main Feb 12, 2025
37 of 40 checks passed

Qubitium deleted the compat_dynamic_bits branch February 12, 2025 17:25

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[CORE] [QUANT] Support for GPTQModel's dynamic quantization per mod…

7e85d57

…ule override/control (vllm-project#7086) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[CORE] [QUANT] Support for GPTQModel's dynamic quantization per mod…

63cda29

…ule override/control (vllm-project#7086)

Uh oh!

[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control #7086

[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control #7086

Uh oh!

Conversation

Qubitium commented Aug 2, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 2, 2024

Uh oh!

mgoin commented Aug 5, 2024

Uh oh!

Qubitium commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 24, 2024

Uh oh!

Qubitium commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin commented Feb 12, 2025

Uh oh!

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Mar 13, 2025

Uh oh!

liweiqing1997 commented Mar 13, 2025

Uh oh!

Uh oh!

[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control #7086

[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control #7086

Qubitium commented Aug 2, 2024 •

edited by github-actions bot

Loading

Qubitium commented Aug 5, 2024 •

edited

Loading

Qubitium commented Feb 11, 2025 •

edited

Loading

Qubitium commented Feb 11, 2025 •

edited

Loading

Qubitium commented Feb 11, 2025 •

edited

Loading

liweiqing1997 commented Mar 13, 2025 •

edited

Loading