Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control #7086

Merged
merged 69 commits into from
Feb 12, 2025
Merged
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
f470b26
gptq_marlin compat dynamic_bits quantize config
ZX-ModelCloud Aug 1, 2024
c56e3de
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Aug 1, 2024
502edb3
Update gptq_marlin.py
Qubitium Aug 2, 2024
18064cd
cleanup
ZX-ModelCloud Aug 2, 2024
1b132c3
cleanup
ZX-ModelCloud Aug 2, 2024
4b63754
cleanup
ZX-ModelCloud Aug 2, 2024
90258d2
cleanup
ZX-ModelCloud Aug 2, 2024
a5d3c8b
cleanup
ZX-ModelCloud Aug 2, 2024
c84793f
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud Aug 2, 2024
5682124
load "dynamic" field from config
ZX-ModelCloud Aug 2, 2024
d651668
fix key error: change "is_sym" to "sym"
ZX-ModelCloud Aug 2, 2024
9a36694
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Aug 6, 2024
fbc594f
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Aug 6, 2024
e9ae8f5
update quant_type
ZX-ModelCloud Aug 6, 2024
19d7772
update
ZX-ModelCloud Dec 24, 2024
7057dbb
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Dec 24, 2024
8565328
fix judgment error
ZX-ModelCloud Dec 24, 2024
84ada54
cleanup
ZX-ModelCloud Dec 24, 2024
e81a7da
cleanup
ZX-ModelCloud Dec 24, 2024
68291ce
cleanup
ZX-ModelCloud Dec 24, 2024
7867405
cleanup
ZX-ModelCloud Dec 24, 2024
c63ba51
cleanup
ZX-ModelCloud Dec 24, 2024
5f9b712
Update gptq_marlin.py
Qubitium Dec 24, 2024
3692578
Update gptq_marlin.py
Qubitium Dec 24, 2024
f902b2d
cleanup
ZX-ModelCloud Dec 24, 2024
a570509
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud Dec 24, 2024
9b9d7e3
Update gptq_marlin.py
Qubitium Dec 24, 2024
0559137
cleanup
ZX-ModelCloud Dec 24, 2024
b29a094
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud Dec 24, 2024
3a2bb94
cleanup
ZX-ModelCloud Dec 24, 2024
3c0d45a
cleanup
ZX-ModelCloud Dec 24, 2024
74b1d42
add test_gptq_dynamic_cfg.py
ZX-ModelCloud Dec 24, 2024
b0672ae
cleanup
ZX-ModelCloud Dec 24, 2024
066f489
Update test_gptq_dynamic_cfg.py
Qubitium Dec 24, 2024
6dc56a6
Update test_gptq_dynamic_cfg.py
Qubitium Dec 24, 2024
98a198e
cleanup
ZX-ModelCloud Dec 24, 2024
b2861d8
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud Dec 24, 2024
c4a29eb
use PROMPT variable
ZX-ModelCloud Dec 24, 2024
25703e3
cleanup
ZX-ModelCloud Dec 24, 2024
1fd690e
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Jan 8, 2025
4f48d1b
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Feb 6, 2025
070ae3c
rename method and add detailed comments
Qubitium Feb 6, 2025
13b2b7b
Changed VocabParallelEmbedding.linear_method to quant_method to be co…
ZX-ModelCloud Feb 7, 2025
6850e6d
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud Feb 7, 2025
40562d1
fix unittest
ZX-ModelCloud Feb 7, 2025
7b774bb
cleanup
ZX-ModelCloud Feb 7, 2025
c72125a
cleanup
ZX-ModelCloud Feb 7, 2025
c298195
cleanup
ZX-ModelCloud Feb 7, 2025
bbc049d
Update gptq_marlin.py
Qubitium Feb 7, 2025
78f8818
format
ZX-ModelCloud Feb 7, 2025
2cfec63
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Feb 7, 2025
93ee576
Update gptq_marlin.py
Qubitium Feb 7, 2025
6ebf85c
rename to parallel_lm_head_quantized for clarity
Qubitium Feb 7, 2025
59bdf54
simplify
Qubitium Feb 7, 2025
9de0382
shorten code
Qubitium Feb 7, 2025
67d0882
cleanup
ZX-ModelCloud Feb 7, 2025
5623936
cleanup
ZX-ModelCloud Feb 7, 2025
e41bdd7
make lint pass
Qubitium Feb 7, 2025
965d7da
change model_id
ZX-ModelCloud Feb 11, 2025
1a34027
format
ZX-ModelCloud Feb 11, 2025
0b249a1
format code
ZX-ModelCloud Feb 11, 2025
4de04ae
format code
ZX-ModelCloud Feb 11, 2025
4c0608b
format code
ZX-ModelCloud Feb 11, 2025
8f21375
disable E712 ruff check
ZX-ModelCloud Feb 11, 2025
e3084e3
Extract code to gptq_utils.get_linear_quant_method()
ZX-ModelCloud Feb 11, 2025
25dbd5a
cleanup
ZX-ModelCloud Feb 11, 2025
874076c
cleanup
ZX-ModelCloud Feb 11, 2025
17704df
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud Feb 11, 2025
c7f10be
do not use Fraction
ZX-ModelCloud Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix judgment error
  • Loading branch information
ZX-ModelCloud committed Dec 24, 2024
commit 856532804685b47661708d352c4850979f047ee7
6 changes: 3 additions & 3 deletions vllm/model_executor/layers/quantization/gptq_marlin.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from vllm.model_executor.layers.fused_moe.layer import (
FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported)
from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
set_weight_attrs, UnquantizedLinearMethod)

Check failure on line 13 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / ruff (3.12)

Ruff (E501)

vllm/model_executor/layers/quantization/gptq_marlin.py:13:81: E501 Line too long (89 > 80)
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig)
from vllm.model_executor.layers.quantization.kernels import (
Expand Down Expand Up @@ -73,10 +73,10 @@
bits = self.weight_bits
# check for variable/dynamic config
if self.dynamic and len(self.dynamic) > 0 and prefix:
bits = self.dynamic_get(prefix, "bits", bits)

Check failure on line 76 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible types in assignment (expression has type "Union[dict[Any, Any], int]", variable has type "int") [assignment]

Check failure on line 76 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]

Check failure on line 76 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]

Check failure on line 76 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]
self.group_size = self.dynamic_get(prefix, "group_size", self.group_size)

Check failure on line 77 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible types in assignment (expression has type "Union[dict[Any, Any], int]", variable has type "int") [assignment]

Check failure on line 77 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / ruff (3.12)

Ruff (E501)

vllm/model_executor/layers/quantization/gptq_marlin.py:77:81: E501 Line too long (85 > 80)

Check failure on line 77 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]

Check failure on line 77 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]

Check failure on line 77 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "int") [assignment]
self.desc_act = self.dynamic_get(prefix, "desc_act", self.desc_act)

Check failure on line 78 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible types in assignment (expression has type "Union[dict[Any, Any], int]", variable has type "bool") [assignment]

Check failure on line 78 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]

Check failure on line 78 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]

Check failure on line 78 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]
self.is_sym = self.dynamic_get(prefix, "sym", self.is_sym)

Check failure on line 79 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible types in assignment (expression has type "Union[dict[Any, Any], int]", variable has type "bool") [assignment]

Check failure on line 79 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]

Check failure on line 79 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]

Check failure on line 79 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible types in assignment (expression has type "dict[Any, Any] | int", variable has type "bool") [assignment]

self.pack_factor = 32 // bits # packed into int32
if (bits, self.is_sym) not in self.TYPE_MAP:
Expand Down Expand Up @@ -141,7 +141,7 @@
" faster inference")
return None

def dynamic_get(self, layer_name: str, key: str = None, default_value: Union[int, bool] = None) -> Union[Dict, int, bool]:

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible default for argument "key" (default has type "None", argument has type "str") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.9)

Incompatible default for argument "default_value" (default has type "None", argument has type "Union[int, bool]") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / ruff (3.12)

Ruff (E501)

vllm/model_executor/layers/quantization/gptq_marlin.py:144:81: E501 Line too long (126 > 80)

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible default for argument "key" (default has type "None", argument has type "str") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.10)

Incompatible default for argument "default_value" (default has type "None", argument has type "int | bool") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible default for argument "key" (default has type "None", argument has type "str") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.11)

Incompatible default for argument "default_value" (default has type "None", argument has type "int | bool") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible default for argument "key" (default has type "None", argument has type "str") [assignment]

Check failure on line 144 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / mypy (3.12)

Incompatible default for argument "default_value" (default has type "None", argument has type "int | bool") [assignment]
for pattern, pattern_dict in self.dynamic.items():
if pattern.startswith("-:"):
if re.match(pattern.removeprefix("-:"), layer_name):
Expand All @@ -156,12 +156,12 @@

def get_quant_method(
self, layer: torch.nn.Module, prefix: str
) -> Optional[Union["GPTQMarlinLinearMethod", "GPTQMarlinMoEMethod", UnquantizedLinearMethod]]:

Check failure on line 159 in vllm/model_executor/layers/quantization/gptq_marlin.py

View workflow job for this annotation

GitHub Actions / ruff (3.12)

Ruff (E501)

vllm/model_executor/layers/quantization/gptq_marlin.py:159:81: E501 Line too long (99 > 80)
if self.dynamic and self.dynamic_get(layer_name=prefix) == False: # noqa: E712
return UnquantizedLinearMethod()

if isinstance(layer, LinearBase) or (isinstance(layer, ParallelLMHead)
and self.lm_head_quantized):
if self.dynamic and self.dynamic_get(layer_name=prefix) == False: # noqa: E712
return UnquantizedLinearMethod()

return GPTQMarlinLinearMethod(self, prefix=prefix)
elif isinstance(layer, FusedMoE):
return GPTQMarlinMoEMethod(self)
Expand Down
Loading