[Misc][Kernel]: Add GPTQAllSpark Quantization #12931

wyajieha · 2025-02-08T02:38:08Z

This PR mainly added specific optimizations for the Ampere architecture A16W8 quantization, supporting the GPTQ quantization model in the scenario where GroupSize=-1 and act_desc is False, and its performance in this scenario is better than Marlin.

Operator performance test (Marlin VS AllSpark) can be performed through the following command：
python3 benchmarks/kernels/benchmark_marlin.py --limit-num-bits 8 --limit-act-order 0 --limit-k-full 1 --limit-group-size -1

The following figure shows the performance comparison of Marlin vs. AllSpark under different M settings for the common Gemm scale of the model on the A100 GPU. The blue line shows the acceleration ratio of Marlin A16W8 Gemm compared to Torch FP16 Gemm, and the orange line shows the acceleration ratio of AllSpark A16W8 Gemm compared to Torch FP16 Gemm. In scenarios where N and K are small and M is large, AllSpark performs significantly better than Marlin. In other scenarios, the performance is basically the same.

Use the following command to perform performance test on the Qwen2-7B-Instruct-quantized.w8a16 model on a single A100 card
CUDA_VISIBLE_DEVICES=1 python3 benchmarks/benchmark_throughput.py --backend=vllm --model Qwen2-7B-Instruct-quantized.w8a16/ --quantization gptq_allspark（or gptq_marlin） --input-len 2048 --output-len 256 --num-prompts=1000 --trust-remote-code --dtype=float16 --kv-cache-dtype=auto --device=cuda

The performance results of the whole network are as follows:

Metrics	Marlin A16W8	AllSpark A16W8
QPS	4.25	5.01(+17.8%)
TPS	9797.44	11551.60 (+17.9%)
Output TPS	1088.60	1283.51(+17.9%)

github-actions · 2025-02-08T02:38:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin

Looks pretty straight-foward, thanks for the nice work! My main questions are on supported hardware for non-Ampere and if we could move this as a mixed-precision kernel backend in vllm/model_executor/layers/quantization/kernels/mixed_precision/ rather than a new quantization method

Could you also run a full model eval with this method to check e2e accuracy? Here is an example with a gptq model on gsm8k

pip install "lm_eval[api]==0.4.4"
lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 1319/1319 [00:20<00:00, 64.37it/s, est. speed input: 63907.10 toks/s, output: 7886.44 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:21<00:00, 62.64it/s]
2025-02-14:19:10:51,817 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5936|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5512|±  |0.0137|

mgoin · 2025-02-14T18:34:07Z

CMakeLists.txt

@@ -297,6 +297,22 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
                   " in CUDA target architectures")
  endif()

+  # AllSpark kernels
+  cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")


Do we need to support all of these arches? For instance are SM89 and SM90 realistic targets?

mgoin · 2025-02-14T18:39:48Z

vllm/model_executor/layers/quantization/utils/allspark_utils.py

+                         capability_tuple.to_int())
+
+    # For Ampere GPU
+    if device_capability >= 80 and device_capability < 90:


This still considers SM89 which is Ada Lovelace, is this intentional?

mgoin · 2025-02-14T18:55:21Z

vllm/model_executor/layers/quantization/gptq_allspark.py

Since you are targeting the GPTQ format, you could likely leverage the existing MPLinearKernel abstraction to plug in GPTQAllSpark as a new possible kernel (see the original RFC for this design)

It could slot into the priority list of available kernels here:

vllm/vllm/model_executor/layers/quantization/kernels/mixed_precision/__init__.py

Lines 16 to 30 in c9e2d64

# in priority/performance order (when available)

_POSSIBLE_KERNELS: List[Type[MPLinearKernel]] = [

MacheteLinearKernel,

MarlinLinearKernel,

ExllamaLinearKernel,

]

def choose_mp_linear_kernel(

config: MPLinearLayerConfig,

compute_capability: Optional[int] = None) -> Type[MPLinearKernel]:

"""

Choose an MPLinearKernel that can implement the given config for the given

compute capability. Attempts to choose the best kernel in terms of

performance.

Which would then be used in gptq_marlin.py here to choose the kernel method:

vllm/vllm/model_executor/layers/quantization/gptq_marlin.py

Lines 223 to 234 in c9e2d64

mp_linear_kernel_config = MPLinearLayerConfig(

full_weight_shape=(input_size, output_size),

partition_weight_shape=\

(input_size_per_partition, output_size_per_partition),

weight_type=self.quant_config.quant_type,

act_type=params_dtype,

group_size=self.quant_config.group_size,

zero_points=False,

has_g_idx=self.quant_config.desc_act

)

kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)

A side benefit of this is it will also be enabled for 8bit models in the compressed-tensors format:

vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

Lines 71 to 82 in c9e2d64

mp_linear_kernel_config = MPLinearLayerConfig(

full_weight_shape=(input_size, output_size),

partition_weight_shape=\

(input_size_per_partition, output_size_per_partition),

weight_type=self.quant_type,

act_type=params_dtype,

group_size=self.group_size,

zero_points=False,

has_g_idx=self.has_g_idx

)

kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)

I realize this could be a bit too much work up front, but if you would be interested in moving to the new interface going forward I think it would help for the longevity of this kernel!

vllm/_custom_ops.py

csrc/quantization/gptq_allspark/allspark_qgemm_a16w8.cu

wyajieha · 2025-02-17T12:54:33Z

Thanks for your detailed reply! The following are some additional points for explanation.

Here are full-model evaluation results with Qwen2-7B-Instruct-quantized.w8a16 model using the newly added gptq_allspark method and original gptq_marlin method on the gsm8k dataset.

lm_eval --model vllm --model_args pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_allspark --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████| 1319/1319 [02:07<00:00, 10.35it/s, est. speed input: 10271.53 toks/s, output: 1478.67 toks/s]
Running generate_until requests: 100%|██████████████████████| 1319/1319 [02:07<00:00, 10.32it/s]
2025-02-17:17:06:03,925 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_allspark,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7589|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.6823|±  |0.0128|

lm_eval --model vllm --model_args pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_marlin --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|████████████████████████████████████| 1319/1319 [02:25<00:00,  9.10it/s, est. speed input: 9030.47 toks/s, output: 1304.16 toks/s]
Running generate_until requests: 100%|████████████████████████| 1319/1319 [02:25<00:00,  9.08it/s]
2025-02-17:17:15:38,819 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_marlin,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7597|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.6823|±  |0.0128|

Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8 is a sub-channel quantization model, and the newly added allspark quantization kernel only supports group_size=-1. So above test uses the Qwen2-7B-Instruct-quantized.w8a16 per-channel quantization model to check e2e accuracy.

The hardware support of the newly added quantization method is as follows: it is fully optimized on the Ampere architecture GPU, that is, sm80 to sm89 (inclusive) , and can run on the Hopper architecture GPU (sm90) but the performance is very poor. Devices below sm80 are not supported. Thank you for your reminder, the cuda_archs_loose_intersection restriction does need to remove the 90 option.
Your suggestion is indeed very reasonable, we cound move this as a mixed-precision kernel backend in vllm/model_executor/layers/quantization/kernels/mixed_precision/ rather than a new quantization method. However, there is a question here. The allspark quantization implementation introduced by this PR and the MarlinLinearKernel implementation are both for the ampere architecture. The difference is that allspark kernel only supports bits=8, group_size=-1, and desc_act=False. If leveraging the existing MPLinearKernel abstraction to plug in GPTQAllSpark as a new possible kernel - AllSparkLinearKernel, could the priority be placed before the MarlinLinearKernel so as to serve as an further optimized implementation of MarlinLinearKernel in the case of bits = 8, group-size=-1 and desc_act=False? Otherwise, theAllSparkLinearKernel will never be called.

mgoin · 2025-02-17T20:46:00Z

Thank you for the response @wyajieha !

Your evals look good, and we see the throughput improvement! ✅
Appreciate the clarity on support. I think it is fine to support SM 80-89 then. We should be cognizant of binary size, but since you don't have many permutations of kernel configs I think it should be small for this kernel.
Yes, what you are suggesting was my assumption when proposing the idea. The priority order will be Machete, AllSpark, Marlin, Exllama - where AllSpark will be selected in the supported case of bits=8, group-size=-1 and desc_act=False (except for SM 90 where Machete will be chosen). I would greatly appreciate if you could take this path of using the existing MPLinearKernel abstraction as we are investing in it going forward and more users will see the benefit of your kernel by default.

wyajieha · 2025-02-18T04:39:26Z

Thank you very much @mgoin! I will modify the code according to all the above comments and submit a new commit later.

Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>

mgoin

Nicely integrated! I've enabled the full CI to run. I'm just a bit curious why you need the weight name for the kernel if you could explain

vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py

Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>

vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-03-02T20:48:44Z

Hey @wyajieha it seems this commit broke the CUDA 11.8 build due to lack of support for type conversions https://buildkite.com/vllm/release/builds/3378/canvas?jid=01955067-903b-4dc6-84fb-878c7e3fb5ea#01955067-903b-4dc6-84fb-878c7e3fb5ea/114-2875

Is there a way to support this on 11.8 or should we just not build the kernels for < CUDA 12.0?

wyajieha · 2025-03-05T11:06:58Z

Hey @wyajieha it seems this commit broke the CUDA 11.8 build due to lack of support for type conversions https://buildkite.com/vllm/release/builds/3378/canvas?jid=01955067-903b-4dc6-84fb-878c7e3fb5ea#01955067-903b-4dc6-84fb-878c7e3fb5ea/114-2875

Is there a way to support this on 11.8 or should we just not build the kernels for < CUDA 12.0?

Apologies for the delay in noticing this. I will fix the issue as soon as possible.

wyajieha · 2025-03-06T03:36:52Z

Hi @mgoin , the compilation error related to type conversions in CUDA 11.8 appears to stem from the activation of CUDA_NO_HALF_OPERATORS and similar flags in cmake/utils.cmake (please see https://github.com/vllm-project/vllm/blob/main/cmake/utils.cmake#L104). I observe that both CUDA 11.8 and CUDA 12+ versions of cuda_fp16.hpp / cuda_bf16.hpp contain definitions for type conversions and arithmetic operations involving half and nv_bfloat16 types. Could you clarify the rationale behind vLLM's decision to disable these conversion operators specifically for CUDA versions prior to 12.0?

In experimental modifications, undefining these flags in allspark_qgemm_w8a16.cu and allspark_utils.cuh successfully compiles under CUDA 11.8. Would this approach be considered safe and acceptable for production use? I'm particularly interested in understanding potential compatibility risks or functional limitations this modification might introduce when working with CUDA 11.8 environments.

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

wyajieha requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners February 8, 2025 02:38

mergify bot added the ci/build label Feb 8, 2025

wyajieha force-pushed the github-yajie-as branch 4 times, most recently from 7438d1a to c18d24b Compare February 11, 2025 09:34

mgoin reviewed Feb 14, 2025

View reviewed changes

wyajieha force-pushed the github-yajie-as branch 4 times, most recently from c8eae49 to 8fbff9f Compare February 23, 2025 17:07

[Misc][Kernel]: Add GPTQAllSpark Quantization

4007cd8

Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>

wyajieha force-pushed the github-yajie-as branch from 8fbff9f to 4007cd8 Compare February 24, 2025 09:41

mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Feb 26, 2025

mgoin approved these changes Feb 26, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py Outdated Show resolved Hide resolved

wyajieha and others added 2 commits February 26, 2025 20:08

[Kernel]: remove prefix in AllSpark kernel

1d29059

Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>

Merge branch 'main' into github-yajie-as

7b2326d

wyajieha commented Feb 27, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py Show resolved Hide resolved

mgoin added 2 commits February 28, 2025 02:26

Remove asserts

9d319eb

Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'main' into github-yajie-as

1a11e5b

mgoin enabled auto-merge (squash) February 28, 2025 02:27

vllm-bot merged commit 6a92ff9 into vllm-project:main Mar 1, 2025
59 of 61 checks passed

Qubitium mentioned this pull request Mar 1, 2025

[KERNEL] AllSpark + Exllama vLLM ModelCloud/GPTQModel#1359

Open

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Misc][Kernel]: Add GPTQAllSpark Quantization (vllm-project#12931)

d92cda2

This was referenced Mar 3, 2025

Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 #14157

Merged

Temporarily disable test_awq_gemm_opcheck #14251

Merged

wyajieha mentioned this pull request Mar 7, 2025

[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 #14430

Merged

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Misc][Kernel]: Add GPTQAllSpark Quantization (vllm-project#12931)

3fe4bfa

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Misc][Kernel]: Add GPTQAllSpark Quantization (vllm-project#12931)

e760dc2

	# in priority/performance order (when available)
	_POSSIBLE_KERNELS: List[Type[MPLinearKernel]] = [
	MacheteLinearKernel,
	MarlinLinearKernel,
	ExllamaLinearKernel,
	]


	def choose_mp_linear_kernel(
	config: MPLinearLayerConfig,
	compute_capability: Optional[int] = None) -> Type[MPLinearKernel]:
	"""
	Choose an MPLinearKernel that can implement the given config for the given
	compute capability. Attempts to choose the best kernel in terms of
	performance.

	mp_linear_kernel_config = MPLinearLayerConfig(
	full_weight_shape=(input_size, output_size),
	partition_weight_shape=\
	(input_size_per_partition, output_size_per_partition),
	weight_type=self.quant_config.quant_type,
	act_type=params_dtype,
	group_size=self.quant_config.group_size,
	zero_points=False,
	has_g_idx=self.quant_config.desc_act
	)

	kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)

Uh oh!

[Misc][Kernel]: Add GPTQAllSpark Quantization #12931

[Misc][Kernel]: Add GPTQAllSpark Quantization #12931

Uh oh!

Conversation

wyajieha commented Feb 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 8, 2025

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wyajieha commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Feb 17, 2025

Uh oh!

wyajieha commented Feb 18, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin commented Mar 2, 2025

Uh oh!

wyajieha commented Mar 5, 2025

Uh oh!

wyajieha commented Mar 6, 2025

Uh oh!

Uh oh!

wyajieha commented Feb 8, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading

wyajieha commented Feb 17, 2025 •

edited

Loading