Skip to content

[Misc][Kernel]: Add GPTQAllSpark Quantization #12931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 1, 2025

Conversation

wyajieha
Copy link
Contributor

@wyajieha wyajieha commented Feb 8, 2025

This PR mainly added specific optimizations for the Ampere architecture A16W8 quantization, supporting the GPTQ quantization model in the scenario where GroupSize=-1 and act_desc is False, and its performance in this scenario is better than Marlin.

Operator performance test (Marlin VS AllSpark) can be performed through the following command:
python3 benchmarks/kernels/benchmark_marlin.py --limit-num-bits 8 --limit-act-order 0 --limit-k-full 1 --limit-group-size -1

The following figure shows the performance comparison of Marlin vs. AllSpark under different M settings for the common Gemm scale of the model on the A100 GPU. The blue line shows the acceleration ratio of Marlin A16W8 Gemm compared to Torch FP16 Gemm, and the orange line shows the acceleration ratio of AllSpark A16W8 Gemm compared to Torch FP16 Gemm. In scenarios where N and K are small and M is large, AllSpark performs significantly better than Marlin. In other scenarios, the performance is basically the same.

image
Use the following command to perform performance test on the Qwen2-7B-Instruct-quantized.w8a16 model on a single A100 card
CUDA_VISIBLE_DEVICES=1 python3 benchmarks/benchmark_throughput.py --backend=vllm --model Qwen2-7B-Instruct-quantized.w8a16/ --quantization gptq_allspark(or gptq_marlin) --input-len 2048 --output-len 256 --num-prompts=1000 --trust-remote-code --dtype=float16 --kv-cache-dtype=auto --device=cuda

The performance results of the whole network are as follows:

Metrics Marlin A16W8 AllSpark A16W8
QPS 4.25 5.01(+17.8%)
TPS 9797.44 11551.60 (+17.9%)
Output TPS 1088.60 1283.51(+17.9%)

Copy link

github-actions bot commented Feb 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Feb 8, 2025
@wyajieha wyajieha force-pushed the github-yajie-as branch 4 times, most recently from 7438d1a to c18d24b Compare February 11, 2025 09:34
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty straight-foward, thanks for the nice work! My main questions are on supported hardware for non-Ampere and if we could move this as a mixed-precision kernel backend in vllm/model_executor/layers/quantization/kernels/mixed_precision/ rather than a new quantization method

Could you also run a full model eval with this method to check e2e accuracy? Here is an example with a gptq model on gsm8k

pip install "lm_eval[api]==0.4.4"
lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 1319/1319 [00:20<00:00, 64.37it/s, est. speed input: 63907.10 toks/s, output: 7886.44 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:21<00:00, 62.64it/s]
2025-02-14:19:10:51,817 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5936|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5512|±  |0.0137|

CMakeLists.txt Outdated
@@ -297,6 +297,22 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
" in CUDA target architectures")
endif()

# AllSpark kernels
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support all of these arches? For instance are SM89 and SM90 realistic targets?

capability_tuple.to_int())

# For Ampere GPU
if device_capability >= 80 and device_capability < 90:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still considers SM89 which is Ada Lovelace, is this intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are targeting the GPTQ format, you could likely leverage the existing MPLinearKernel abstraction to plug in GPTQAllSpark as a new possible kernel (see the original RFC for this design)

It could slot into the priority list of available kernels here:

# in priority/performance order (when available)
_POSSIBLE_KERNELS: List[Type[MPLinearKernel]] = [
MacheteLinearKernel,
MarlinLinearKernel,
ExllamaLinearKernel,
]
def choose_mp_linear_kernel(
config: MPLinearLayerConfig,
compute_capability: Optional[int] = None) -> Type[MPLinearKernel]:
"""
Choose an MPLinearKernel that can implement the given config for the given
compute capability. Attempts to choose the best kernel in terms of
performance.

Which would then be used in gptq_marlin.py here to choose the kernel method:

mp_linear_kernel_config = MPLinearLayerConfig(
full_weight_shape=(input_size, output_size),
partition_weight_shape=\
(input_size_per_partition, output_size_per_partition),
weight_type=self.quant_config.quant_type,
act_type=params_dtype,
group_size=self.quant_config.group_size,
zero_points=False,
has_g_idx=self.quant_config.desc_act
)
kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)

A side benefit of this is it will also be enabled for 8bit models in the compressed-tensors format:

mp_linear_kernel_config = MPLinearLayerConfig(
full_weight_shape=(input_size, output_size),
partition_weight_shape=\
(input_size_per_partition, output_size_per_partition),
weight_type=self.quant_type,
act_type=params_dtype,
group_size=self.group_size,
zero_points=False,
has_g_idx=self.has_g_idx
)
kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)

I realize this could be a bit too much work up front, but if you would be interested in moving to the new interface going forward I think it would help for the longevity of this kernel!

@wyajieha
Copy link
Contributor Author

wyajieha commented Feb 17, 2025

Thanks for your detailed reply! The following are some additional points for explanation.

  1. Here are full-model evaluation results with Qwen2-7B-Instruct-quantized.w8a16 model using the newly added gptq_allspark method and original gptq_marlin method on the gsm8k dataset.
lm_eval --model vllm --model_args pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_allspark --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████| 1319/1319 [02:07<00:00, 10.35it/s, est. speed input: 10271.53 toks/s, output: 1478.67 toks/s]
Running generate_until requests: 100%|██████████████████████| 1319/1319 [02:07<00:00, 10.32it/s]
2025-02-17:17:06:03,925 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_allspark,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7589|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.6823|±  |0.0128|
lm_eval --model vllm --model_args pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_marlin --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|████████████████████████████████████| 1319/1319 [02:25<00:00,  9.10it/s, est. speed input: 9030.47 toks/s, output: 1304.16 toks/s]
Running generate_until requests: 100%|████████████████████████| 1319/1319 [02:25<00:00,  9.08it/s]
2025-02-17:17:15:38,819 INFO     [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen2-7B-Instruct-quantized.w8a16,quantization=gptq_marlin,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7597|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.6823|±  |0.0128|

Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8 is a sub-channel quantization model, and the newly added allspark quantization kernel only supports group_size=-1. So above test uses the Qwen2-7B-Instruct-quantized.w8a16 per-channel quantization model to check e2e accuracy.

  1. The hardware support of the newly added quantization method is as follows: it is fully optimized on the Ampere architecture GPU, that is, sm80 to sm89 (inclusive) , and can run on the Hopper architecture GPU (sm90) but the performance is very poor. Devices below sm80 are not supported. Thank you for your reminder, the cuda_archs_loose_intersection restriction does need to remove the 90 option.

  2. Your suggestion is indeed very reasonable, we cound move this as a mixed-precision kernel backend in vllm/model_executor/layers/quantization/kernels/mixed_precision/ rather than a new quantization method. However, there is a question here. The allspark quantization implementation introduced by this PR and the MarlinLinearKernel implementation are both for the ampere architecture. The difference is that allspark kernel only supports bits=8, group_size=-1, and desc_act=False. If leveraging the existing MPLinearKernel abstraction to plug in GPTQAllSpark as a new possible kernel - AllSparkLinearKernel, could the priority be placed before the MarlinLinearKernel so as to serve as an further optimized implementation of MarlinLinearKernel in the case of bits = 8, group-size=-1 and desc_act=False? Otherwise, theAllSparkLinearKernel will never be called.

@mgoin
Copy link
Member

mgoin commented Feb 17, 2025

Thank you for the response @wyajieha !

  1. Your evals look good, and we see the throughput improvement! ✅
  2. Appreciate the clarity on support. I think it is fine to support SM 80-89 then. We should be cognizant of binary size, but since you don't have many permutations of kernel configs I think it should be small for this kernel.
  3. Yes, what you are suggesting was my assumption when proposing the idea. The priority order will be Machete, AllSpark, Marlin, Exllama - where AllSpark will be selected in the supported case of bits=8, group-size=-1 and desc_act=False (except for SM 90 where Machete will be chosen). I would greatly appreciate if you could take this path of using the existing MPLinearKernel abstraction as we are investing in it going forward and more users will see the benefit of your kernel by default.

@wyajieha
Copy link
Contributor Author

Thank you very much @mgoin! I will modify the code according to all the above comments and submit a new commit later.

@wyajieha wyajieha force-pushed the github-yajie-as branch 4 times, most recently from c8eae49 to 8fbff9f Compare February 23, 2025 17:07
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>
@mgoin mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels Feb 26, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely integrated! I've enabled the full CI to run. I'm just a bit curious why you need the weight name for the kernel if you could explain

wyajieha and others added 2 commits February 26, 2025 20:08
Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin enabled auto-merge (squash) February 28, 2025 02:27
@vllm-bot vllm-bot merged commit 6a92ff9 into vllm-project:main Mar 1, 2025
59 of 61 checks passed
@mgoin
Copy link
Member

mgoin commented Mar 2, 2025

Hey @wyajieha it seems this commit broke the CUDA 11.8 build due to lack of support for type conversions https://buildkite.com/vllm/release/builds/3378/canvas?jid=01955067-903b-4dc6-84fb-878c7e3fb5ea#01955067-903b-4dc6-84fb-878c7e3fb5ea/114-2875

Is there a way to support this on 11.8 or should we just not build the kernels for < CUDA 12.0?

@wyajieha
Copy link
Contributor Author

wyajieha commented Mar 5, 2025

Hey @wyajieha it seems this commit broke the CUDA 11.8 build due to lack of support for type conversions https://buildkite.com/vllm/release/builds/3378/canvas?jid=01955067-903b-4dc6-84fb-878c7e3fb5ea#01955067-903b-4dc6-84fb-878c7e3fb5ea/114-2875

Is there a way to support this on 11.8 or should we just not build the kernels for < CUDA 12.0?

Apologies for the delay in noticing this. I will fix the issue as soon as possible.

@wyajieha
Copy link
Contributor Author

wyajieha commented Mar 6, 2025

Hi @mgoin , the compilation error related to type conversions in CUDA 11.8 appears to stem from the activation of CUDA_NO_HALF_OPERATORS and similar flags in cmake/utils.cmake (please see https://github.com/vllm-project/vllm/blob/main/cmake/utils.cmake#L104). I observe that both CUDA 11.8 and CUDA 12+ versions of cuda_fp16.hpp / cuda_bf16.hpp contain definitions for type conversions and arithmetic operations involving half and nv_bfloat16 types. Could you clarify the rationale behind vLLM's decision to disable these conversion operators specifically for CUDA versions prior to 12.0?

In experimental modifications, undefining these flags in allspark_qgemm_w8a16.cu and allspark_utils.cuh successfully compiles under CUDA 11.8. Would this approach be considered safe and acceptable for production use? I'm particularly interested in understanding potential compatibility risks or functional limitations this modification might introduce when working with CUDA 11.8 environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build quantization ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants