Skip to content

Some fixes for AWQ #269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 10, 2025
Merged

Some fixes for AWQ #269

merged 5 commits into from
Apr 10, 2025

Conversation

rahul-tuli
Copy link
Member

@rahul-tuli rahul-tuli commented Mar 7, 2025

This aligns the logic for asymmetric quantization with the implementation found in AutoAWQ's pseudo_quantize_tensor function. This is some core logic here so we should all make sure we're in agreement with the changes before merging in.

To be reviewed/merged in conjunction with vllm-project/llm-compressor#1177

Copy link
Contributor

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to ready for review!

@manaalmj
Copy link

manaalmj commented Apr 4, 2025

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

@brian-dellabetta
Copy link
Contributor

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

Hi @manaalmj thanks for reaching out. You can find all the presets starting here. W4A16 means weights are quantized to 4 bits while activations are kept at FP16 (so no activation quantization is seen in the args), as compared to W8A8 where both are quantized to 8 bits

rahul-tuli and others added 5 commits April 9, 2025 17:30
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
@dsikka dsikka merged commit 3a88875 into main Apr 10, 2025
1 check passed
@dsikka dsikka deleted the awq-fixes branch April 10, 2025 15:44
@nikhil-arm
Copy link

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

Hi @manaalmj thanks for reaching out. You can find all the presets starting here. W4A16 means weights are quantized to 4 bits while activations are kept at FP16 (so no activation quantization is seen in the args), as compared to W8A8 where both are quantized to 8 bits

Hello @rahul-tuli , thanks for your inputs.
We have an internal change to integrate 4 bit matmul / linear kernels (pytorch/pytorch#143289)
The api takes activations as floating point ( fp32/bf16/fp16) but internally quantized the activation to 8 bit.

Should the preset naming scheme reflect to user that activations would be quantized to 8 bit or should it reflect that activations are NOT quantized at llm-compressor/compressed tensor level unlike weights?

We planing to contribute optimizations and quantized kernels to vLLM and would like to contribute our scheme presets as well.

@dsikka
Copy link
Collaborator

dsikka commented Apr 11, 2025

pytorch/pytorch#143289

Hi @nikhil-arm

Our preset schemes found under the quantization_schemes indicate whether or not activations are quantized. This is indicated by the value after the A in W8A8 for example, which indicates that the activations are quantized to 8bits.

@nikhil-arm
Copy link

nikhil-arm commented Apr 11, 2025

pytorch/pytorch#143289

Hi @nikhil-arm

Our preset schemes found under the quantization_schemes indicate whether or not activations are quantized. This is indicated by the value after the A in W8A8 for example, which indicates that the activations are quantized to 8bits.

Hello @dsikka , Thanks for your reply.

Sorry, I think I was not clear with my question.

When you say
Our preset schemes found under the [quantization_schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py#L33) indicate whether or not activations are quantized.

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?

Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.

What should be ideal naming scheme for it? w4a16 or w4a8?

@brian-dellabetta
Copy link
Contributor

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?

Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.

What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.

w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

@nikhil-arm
Copy link

nikhil-arm commented Apr 11, 2025

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.

w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens

The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.

The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.

If I understood you correctly you are okay with calling it w4a16?

@brian-dellabetta
Copy link
Contributor

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.
w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens

The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.

The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.

If I understood you correctly you are okay with calling it w4a16?

Hi @nikhil-arm , in that case what you're doing is called QDQ (quantization + de-quantization). Quantize fp16 to fp8, operate on fp8, de-quantize back to fp16. You'd probably want to make sure that was clear to the users, as it is not exactly equivalent to w4a16 and is not performant. If you're following that up with another QDQ in the next layer, why not keep the activations as fp8? That would be the conventional w4a8

@nikhil-arm
Copy link

nikhil-arm commented Apr 11, 2025

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.
w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens
The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.
The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.
If I understood you correctly you are okay with calling it w4a16?

Hi @nikhil-arm , in that case what you're doing is called QDQ (quantization + de-quantization). Quantize fp16 to fp8, operate on fp8, de-quantize back to fp16. You'd probably want to make sure that was clear to the users, as it is not exactly equivalent to w4a16 and is not performant. If you're following that up with another QDQ in the next layer, why not keep the activations as fp8? That would be the conventional w4a8

why not keep the activations as fp8?
couple of reasons

  1. Thats how our kernels are designed
  2. We are not dequantization at end, the arm 8 bit matmul instruction outputs int32 data directly and we scale to fp32/fp16 as per the scales of weights and activations. So it depends on hardware instructions. Basically if you run 4 bit or 8 bit matmul you would accumulate in fp32 to avoid overflows. So the hardware insturction would always output 32 bit data.
  3. To take in fp8/int8 activation in next layer, we will need to rescale them as per the statically calculated scales of next layer but we would like to stick to dynamic activation quantization for better accuracy for LLMs.

This is very efficient for us latency and accuracy wise.

But than you very much for your suggestions, do you have an example of matmuls that outputs fp8 /int8 data type that is consumed by next layer as fp8/int8?
Even if matmuls work with this type of chained fp8/int8 approach, you would anyway dequantize the fp8/int8 output for operations like softmax, attention, rope , layer norms etc?

@brian-dellabetta
Copy link
Contributor

1. Thats how our kernels are designed

ah, roger that!

But than you very much for your suggestions, do you have an example of matmuls that outputs fp8 /int8 data type that is consumed by next layer as fp8/int8? Even if matmuls work with this type of chained fp8/int8 approach, you would anyway dequantize the fp8/int8 output for operations like softmax, attention, rope , layer norms etc?

For this reason, wNa8 only works for specific GPU architectures with fp8 kernels, see vllm docs here for more info. If it does support though, it will do those softmax/attention/layer_norm operations in fp8.

You are right, usually the end lm_head and beginning embedding layers are kept in fp16, I believe they will get upconverted / dequantized back to fp16 to account for that. But if we are quantizing all the middle layers, it doesn't have to do it at every layer. I am not sure how much of a computational cost this amounts to though.

@nikhil-arm
Copy link

Thanks @brian-dellabetta . @manaalmj I think based on this valuable feedback we can go ahead with w8a8/w4a8 for our integration.

dsikka pushed a commit to vllm-project/llm-compressor that referenced this pull request Apr 21, 2025
SUMMARY:
Addition of [`AWQModifier`](https://arxiv.org/pdf/2306.00978), based on
[AutoAWQ
implementation](https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L28).

Should be reviewed/merged in conjunction with
neuralmagic/compressed-tensors#269

Replaces #181 and #824 

TEST PLAN:
Some unit tests included, but as this was mostly a port from AutoAWQ, we
validated the code by ensuring we could reproduce the evaluation metrics
in Table 4 of [the paper](https://arxiv.org/pdf/2306.00978). We achieve
the following wikitext PPL scores:

Llama-2 7B Group 128:
1. Paper: 5.60
2. AutoAWQ: 5.615
3. This implementation: 5.612
4. we match what the paper reports for just RTN -- 5.73
5. We get reasonable results for channel-wise -- 6.788. AutoAWQ errors
out for this (setting "q_group_size": -1 in the quant_config), and
results not reported in paper.

Llama-2 13B Group 128:
1. We match the results of AutoAWQ and the results shown in the paper:
4.97
2. We match what the paper reports for just RTN -- 4.984

NOTE: We are excluding the clipping logic in this implementation, if we
want to add it we should add it as another modifier, they are mutually
exclusive and the data model for AWQ doesn't align well with clipping.
That might be the reason for the slight deviation of results reported in
the paper and in our implementation

---------

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants