-
Notifications
You must be signed in to change notification settings - Fork 11
Some fixes for AWQ #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some fixes for AWQ #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to ready for review!
909fd32
to
afc8a7a
Compare
fcea528
to
125ff36
Compare
I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer. |
Hi @manaalmj thanks for reaching out. You can find all the presets starting here. W4A16 means weights are quantized to 4 bits while activations are kept at FP16 (so no activation quantization is seen in the args), as compared to W8A8 where both are quantized to 8 bits |
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
125ff36
to
f476970
Compare
Hello @rahul-tuli , thanks for your inputs. Should the preset naming scheme reflect to user that activations would be quantized to 8 bit or should it reflect that activations are NOT quantized at llm-compressor/compressed tensor level unlike weights? We planing to contribute optimizations and quantized kernels to vLLM and would like to contribute our scheme presets as well. |
Hi @nikhil-arm Our preset schemes found under the quantization_schemes indicate whether or not activations are quantized. This is indicated by the value after the |
Hello @dsikka , Thanks for your reply. Sorry, I think I was not clear with my question. When you say Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ? Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level. What should be ideal naming scheme for it? w4a16 or w4a8? |
Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be
|
Thanks for your reply, I am gonna give you a small example how it happens The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues. The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output. If I understood you correctly you are okay with calling it w4a16? |
Hi @nikhil-arm , in that case what you're doing is called QDQ (quantization + de-quantization). Quantize fp16 to fp8, operate on fp8, de-quantize back to fp16. You'd probably want to make sure that was clear to the users, as it is not exactly equivalent to |
This is very efficient for us latency and accuracy wise. But than you very much for your suggestions, do you have an example of matmuls that outputs fp8 /int8 data type that is consumed by next layer as fp8/int8? |
ah, roger that!
For this reason, You are right, usually the end |
Thanks @brian-dellabetta . @manaalmj I think based on this valuable feedback we can go ahead with w8a8/w4a8 for our integration. |
SUMMARY: Addition of [`AWQModifier`](https://arxiv.org/pdf/2306.00978), based on [AutoAWQ implementation](https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L28). Should be reviewed/merged in conjunction with neuralmagic/compressed-tensors#269 Replaces #181 and #824 TEST PLAN: Some unit tests included, but as this was mostly a port from AutoAWQ, we validated the code by ensuring we could reproduce the evaluation metrics in Table 4 of [the paper](https://arxiv.org/pdf/2306.00978). We achieve the following wikitext PPL scores: Llama-2 7B Group 128: 1. Paper: 5.60 2. AutoAWQ: 5.615 3. This implementation: 5.612 4. we match what the paper reports for just RTN -- 5.73 5. We get reasonable results for channel-wise -- 6.788. AutoAWQ errors out for this (setting "q_group_size": -1 in the quant_config), and results not reported in paper. Llama-2 13B Group 128: 1. We match the results of AutoAWQ and the results shown in the paper: 4.97 2. We match what the paper reports for just RTN -- 4.984 NOTE: We are excluding the clipping logic in this implementation, if we want to add it we should add it as another modifier, they are mutually exclusive and the data model for AWQ doesn't align well with clipping. That might be the reason for the slight deviation of results reported in the paper and in our implementation --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
This aligns the logic for asymmetric quantization with the implementation found in AutoAWQ's
pseudo_quantize_tensor
function. This is some core logic here so we should all make sure we're in agreement with the changes before merging in.To be reviewed/merged in conjunction with vllm-project/llm-compressor#1177