-
-
Notifications
You must be signed in to change notification settings - Fork 7.3k
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels #17112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This is really cool! Looking forward to review when ready |
Description: 1. Add optimized kernel support for Arm 4 bit matmul kernels Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
4807d37
to
b258f93
Compare
Thanks @mgoin , The change is ready for initial review from my end. Can you help me with CI failures.
|
Also will it be possible to allow some of my team members to push branches to vLLM project, allow adding specific reviewers etc. for better contributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR! How are you thinking of testing the integration?
We don't recommend pushing branches directly to vLLM as there are many developers, rather please continue to make branches on your fork and submit PRs against upstream main. I recommend joining the Developer Slack to discuss complex work! Also would you be willing to keep copyright changes/additions to just files that are completely new or remove? We don't use this header, just the Apache-2.0 one, and we want consistency across the code base. |
Hello,
We are checking on this and will get back with some resolution soon |
Hello, |
Hello @mgoin ,
|
|
||
scales_and_zp_size = input_size_per_partition // effective_group_size | ||
|
||
weight = ModelWeightParameter(data=torch.empty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just int8 values in the in4 range I'm assuming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share a model checkpoint that has been quantized this way? We don't have an ARM runner at the moment for vLLM but we also don't have testing for W4A8 production in llmcompressor either, so I just worry about this format being tested over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove these debug changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the prints.
I remember this issue about a particular layer not being found was reported earlier. We faced the same issue and skipped the tensor and it worked fine. I think resolution along same lines were provided by vLLM team as well. I will find and link the issue for reference
Given that vLLM now has contributions from over 1000 individuals, maintaining a list of copyright holders seems overburdensome. It will inevitably always be incomplete and/or out of date. I'd rather we avoid it. These notices do not actually reflect copyright ownership, and if really needed, all of that information can be retrieved through git history. I'd rather we do not expand the use of these notices in the code base. There are some exceptions, such as copying code from another place and we want to (or are required to by the license) attribute the source, but in general, I don't think it's necessary. |
I can share the llm compressor config that can quantize the model with w416. Please check this for more context |
W4A16 is for weight only quantization. Do you have a model with W4A8 quantization from compressed-tensors to validate the integration? Or are you just editing the config from W4A16? We have a sample scheme for W4A8: https://github.com/neuralmagic/compressed-tensors/blob/1068c848e420443d4d5d73fe031b78bf7a832926/src/compressed_tensors/quantization/quant_scheme.py#L158 |
The model quantization at llmcompressor level is w4a16. When the Linear layers go to KleidiAI kernel in pytorch, the activations are quantized down to 8 bit. As per the this comment neuralmagic/compressed-tensors#269 (comment) from @brian-dellabetta we want to make it clear to the user that activations are quantized to 8 bit at any level ( llmcompressor or pytorch ). This is the reason we are calling our scheme w4a8 but it is actually w416 at llmcompressor level. |
@nikhil-arm Even if compressed-tensors is not generating a scale for the activations that are stored in the checkpoint, this information needs to be captured in the config. We expect compressed-tensors models with dynamic activations to still have a config for its activations as is expected by the integration layer in vLLM. Do you have a sample model that you can share that you ran to test the integration? |
Hello, can not share a model externally but I can share a small script to quantize supported models
When you try to inference the converted model, please upcast the model to float32 at vLLM inference time. |
Description: