-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Misc][Kernel]: Add GPTQAllSpark Quantization #12931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
7438d1a
to
c18d24b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty straight-foward, thanks for the nice work! My main questions are on supported hardware for non-Ampere and if we could move this as a mixed-precision kernel backend in vllm/model_executor/layers/quantization/kernels/mixed_precision/
rather than a new quantization method
Could you also run a full model eval with this method to check e2e accuracy? Here is an example with a gptq model on gsm8k
pip install "lm_eval[api]==0.4.4"
lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
Processed prompts: 100%|████████████████████████████████████████████████████████████████████| 1319/1319 [00:20<00:00, 64.37it/s, est. speed input: 63907.10 toks/s, output: 7886.44 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:21<00:00, 62.64it/s]
2025-02-14:19:10:51,817 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8,quantization=gptq_marlin,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.5936|± |0.0135|
| | |strict-match | 5|exact_match|↑ |0.5512|± |0.0137|
CMakeLists.txt
Outdated
@@ -297,6 +297,22 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") | |||
" in CUDA target architectures") | |||
endif() | |||
|
|||
# AllSpark kernels | |||
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support all of these arches? For instance are SM89 and SM90 realistic targets?
capability_tuple.to_int()) | ||
|
||
# For Ampere GPU | ||
if device_capability >= 80 and device_capability < 90: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still considers SM89 which is Ada Lovelace, is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are targeting the GPTQ format, you could likely leverage the existing MPLinearKernel
abstraction to plug in GPTQAllSpark as a new possible kernel (see the original RFC for this design)
It could slot into the priority list of available kernels here:
vllm/vllm/model_executor/layers/quantization/kernels/mixed_precision/__init__.py
Lines 16 to 30 in c9e2d64
# in priority/performance order (when available) | |
_POSSIBLE_KERNELS: List[Type[MPLinearKernel]] = [ | |
MacheteLinearKernel, | |
MarlinLinearKernel, | |
ExllamaLinearKernel, | |
] | |
def choose_mp_linear_kernel( | |
config: MPLinearLayerConfig, | |
compute_capability: Optional[int] = None) -> Type[MPLinearKernel]: | |
""" | |
Choose an MPLinearKernel that can implement the given config for the given | |
compute capability. Attempts to choose the best kernel in terms of | |
performance. |
Which would then be used in gptq_marlin.py here to choose the kernel method:
vllm/vllm/model_executor/layers/quantization/gptq_marlin.py
Lines 223 to 234 in c9e2d64
mp_linear_kernel_config = MPLinearLayerConfig( | |
full_weight_shape=(input_size, output_size), | |
partition_weight_shape=\ | |
(input_size_per_partition, output_size_per_partition), | |
weight_type=self.quant_config.quant_type, | |
act_type=params_dtype, | |
group_size=self.quant_config.group_size, | |
zero_points=False, | |
has_g_idx=self.quant_config.desc_act | |
) | |
kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config) |
A side benefit of this is it will also be enabled for 8bit models in the compressed-tensors format:
vllm/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Lines 71 to 82 in c9e2d64
mp_linear_kernel_config = MPLinearLayerConfig( | |
full_weight_shape=(input_size, output_size), | |
partition_weight_shape=\ | |
(input_size_per_partition, output_size_per_partition), | |
weight_type=self.quant_type, | |
act_type=params_dtype, | |
group_size=self.group_size, | |
zero_points=False, | |
has_g_idx=self.has_g_idx | |
) | |
kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config) |
I realize this could be a bit too much work up front, but if you would be interested in moving to the new interface going forward I think it would help for the longevity of this kernel!
Thanks for your detailed reply! The following are some additional points for explanation.
Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8 is a sub-channel quantization model, and the newly added allspark quantization kernel only supports group_size=-1. So above test uses the Qwen2-7B-Instruct-quantized.w8a16 per-channel quantization model to check e2e accuracy.
|
Thank you for the response @wyajieha !
|
Thank you very much @mgoin! I will modify the code according to all the above comments and submit a new commit later. |
c8eae49
to
8fbff9f
Compare
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>
8fbff9f
to
4007cd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely integrated! I've enabled the full CI to run. I'm just a bit curious why you need the weight name for the kernel if you could explain
vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py
Outdated
Show resolved
Hide resolved
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>
vllm/model_executor/layers/quantization/kernels/mixed_precision/allspark.py
Show resolved
Hide resolved
Signed-off-by: mgoin <mgoin64@gmail.com>
Hey @wyajieha it seems this commit broke the CUDA 11.8 build due to lack of support for type conversions https://buildkite.com/vllm/release/builds/3378/canvas?jid=01955067-903b-4dc6-84fb-878c7e3fb5ea#01955067-903b-4dc6-84fb-878c7e3fb5ea/114-2875 Is there a way to support this on 11.8 or should we just not build the kernels for < CUDA 12.0? |
Apologies for the delay in noticing this. I will fix the issue as soon as possible. |
Hi @mgoin , the compilation error related to type conversions in CUDA 11.8 appears to stem from the activation of CUDA_NO_HALF_OPERATORS and similar flags in cmake/utils.cmake (please see https://github.com/vllm-project/vllm/blob/main/cmake/utils.cmake#L104). I observe that both CUDA 11.8 and CUDA 12+ versions of cuda_fp16.hpp / cuda_bf16.hpp contain definitions for type conversions and arithmetic operations involving half and nv_bfloat16 types. Could you clarify the rationale behind vLLM's decision to disable these conversion operators specifically for CUDA versions prior to 12.0? In experimental modifications, undefining these flags in allspark_qgemm_w8a16.cu and allspark_utils.cuh successfully compiles under CUDA 11.8. Would this approach be considered safe and acceptable for production use? I'm particularly interested in understanding potential compatibility risks or functional limitations this modification might introduce when working with CUDA 11.8 environments. |
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
This PR mainly added specific optimizations for the Ampere architecture A16W8 quantization, supporting the GPTQ quantization model in the scenario where GroupSize=-1 and act_desc is False, and its performance in this scenario is better than Marlin.
Operator performance test (Marlin VS AllSpark) can be performed through the following command:
python3 benchmarks/kernels/benchmark_marlin.py --limit-num-bits 8 --limit-act-order 0 --limit-k-full 1 --limit-group-size -1
The following figure shows the performance comparison of Marlin vs. AllSpark under different M settings for the common Gemm scale of the model on the A100 GPU. The blue line shows the acceleration ratio of Marlin A16W8 Gemm compared to Torch FP16 Gemm, and the orange line shows the acceleration ratio of AllSpark A16W8 Gemm compared to Torch FP16 Gemm. In scenarios where N and K are small and M is large, AllSpark performs significantly better than Marlin. In other scenarios, the performance is basically the same.
Use the following command to perform performance test on the Qwen2-7B-Instruct-quantized.w8a16 model on a single A100 card
CUDA_VISIBLE_DEVICES=1 python3 benchmarks/benchmark_throughput.py --backend=vllm --model Qwen2-7B-Instruct-quantized.w8a16/ --quantization gptq_allspark(or gptq_marlin) --input-len 2048 --output-len 256 --num-prompts=1000 --trust-remote-code --dtype=float16 --kv-cache-dtype=auto --device=cuda
The performance results of the whole network are as follows: