Some fixes for AWQ #269

rahul-tuli · 2025-03-07T01:54:12Z

This aligns the logic for asymmetric quantization with the implementation found in AutoAWQ's pseudo_quantize_tensor function. This is some core logic here so we should all make sure we're in agreement with the changes before merging in.

To be reviewed/merged in conjunction with vllm-project/llm-compressor#1177

src/compressed_tensors/quantization/quant_scheme.py

brian-dellabetta

Moving this to ready for review!

src/compressed_tensors/quantization/quant_scheme.py

src/compressed_tensors/quantization/utils/helpers.py

src/compressed_tensors/quantization/quant_scheme.py

src/compressed_tensors/quantization/utils/helpers.py

manaalmj · 2025-04-04T11:34:08Z

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

brian-dellabetta · 2025-04-04T15:34:12Z

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

Hi @manaalmj thanks for reaching out. You can find all the presets starting here. W4A16 means weights are quantized to 4 bits while activations are kept at FP16 (so no activation quantization is seen in the args), as compared to W8A8 where both are quantized to 8 bits

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

nikhil-arm · 2025-04-11T14:34:55Z

I want to add a preset scheme in quant_scheme.py, similar to W4A16_ASYM that has been added here and I need some clarification about the naming of the preset schemes. Does W4A16 mean that only the weights are quantized and activations are used as it is or does it mean that activations are just fed as 16 bit to the linear layer.

Hi @manaalmj thanks for reaching out. You can find all the presets starting here. W4A16 means weights are quantized to 4 bits while activations are kept at FP16 (so no activation quantization is seen in the args), as compared to W8A8 where both are quantized to 8 bits

Hello @rahul-tuli , thanks for your inputs.
We have an internal change to integrate 4 bit matmul / linear kernels (pytorch/pytorch#143289)
The api takes activations as floating point ( fp32/bf16/fp16) but internally quantized the activation to 8 bit.

Should the preset naming scheme reflect to user that activations would be quantized to 8 bit or should it reflect that activations are NOT quantized at llm-compressor/compressed tensor level unlike weights?

We planing to contribute optimizations and quantized kernels to vLLM and would like to contribute our scheme presets as well.

dsikka · 2025-04-11T16:13:52Z

pytorch/pytorch#143289

Hi @nikhil-arm

Our preset schemes found under the quantization_schemes indicate whether or not activations are quantized. This is indicated by the value after the A in W8A8 for example, which indicates that the activations are quantized to 8bits.

nikhil-arm · 2025-04-11T16:34:42Z

pytorch/pytorch#143289

Hi @nikhil-arm

Our preset schemes found under the quantization_schemes indicate whether or not activations are quantized. This is indicated by the value after the A in W8A8 for example, which indicates that the activations are quantized to 8bits.

Hello @dsikka , Thanks for your reply.

Sorry, I think I was not clear with my question.

When you say
Our preset schemes found under the [quantization_schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py#L33) indicate whether or not activations are quantized.

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?

Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.

What should be ideal naming scheme for it? w4a16 or w4a8?

brian-dellabetta · 2025-04-11T17:48:23Z

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?

Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.

What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.

w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

nikhil-arm · 2025-04-11T17:54:40Z

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.

w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens

The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.

The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.

If I understood you correctly you are okay with calling it w4a16?

brian-dellabetta · 2025-04-11T18:36:06Z

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.
w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens

The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.

The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.

If I understood you correctly you are okay with calling it w4a16?

Hi @nikhil-arm , in that case what you're doing is called QDQ (quantization + de-quantization). Quantize fp16 to fp8, operate on fp8, de-quantize back to fp16. You'd probably want to make sure that was clear to the users, as it is not exactly equivalent to w4a16 and is not performant. If you're following that up with another QDQ in the next layer, why not keep the activations as fp8? That would be the conventional w4a8

nikhil-arm · 2025-04-11T18:46:00Z

Do you mean the scheme indicates that activations are quantized at compressed_tensor / llm_compressor level or does it indicate that activations are quantized at any level ( pytorch/compressed_tensor/llm_compressor) ?
Arm 4 bit and 8 bit quantized matmul operations does not quantize activations at compressed_tensor level but internally quantize activations at pytorch kernel level.
What should be ideal naming scheme for it? w4a16 or w4a8?

Hi @nikhil-arm , naming should be agnostic to where in the stack it occurs. It's a model definition, not specific to implementation. So if activations are passed to the next module as 8 bits per value, and your weights are 4 bits per value, it would be w4a8.
w4a16 means input vectors are fp16, and matmuled by int4-quantized weights to get a fp16 output vectors.

Thanks for your reply, I am gonna give you a small example how it happens
The operation inputs fp16 and outputs fp16. When the fp16 activation goes into 4 bit matmul, we quantise it internally to module to 8 bit, perform matrix multiplication ( 8bit activation x 4 bit weight), convert output back to fp16 and the next module recieve it as fp16 and the cycle continues.
The matmul happens between 8 bit activations and 4 bit weights but this is not exposed to user. If you look it at nn.Module level it would appear like its taking in fp16 activation, int4 weight and giving out fp16 output.
If I understood you correctly you are okay with calling it w4a16?

Hi @nikhil-arm , in that case what you're doing is called QDQ (quantization + de-quantization). Quantize fp16 to fp8, operate on fp8, de-quantize back to fp16. You'd probably want to make sure that was clear to the users, as it is not exactly equivalent to w4a16 and is not performant. If you're following that up with another QDQ in the next layer, why not keep the activations as fp8? That would be the conventional w4a8

why not keep the activations as fp8?
couple of reasons

Thats how our kernels are designed
We are not dequantization at end, the arm 8 bit matmul instruction outputs int32 data directly and we scale to fp32/fp16 as per the scales of weights and activations. So it depends on hardware instructions. Basically if you run 4 bit or 8 bit matmul you would accumulate in fp32 to avoid overflows. So the hardware insturction would always output 32 bit data.
To take in fp8/int8 activation in next layer, we will need to rescale them as per the statically calculated scales of next layer but we would like to stick to dynamic activation quantization for better accuracy for LLMs.

This is very efficient for us latency and accuracy wise.

But than you very much for your suggestions, do you have an example of matmuls that outputs fp8 /int8 data type that is consumed by next layer as fp8/int8?
Even if matmuls work with this type of chained fp8/int8 approach, you would anyway dequantize the fp8/int8 output for operations like softmax, attention, rope , layer norms etc?

brian-dellabetta · 2025-04-11T20:53:58Z

1. Thats how our kernels are designed

ah, roger that!

But than you very much for your suggestions, do you have an example of matmuls that outputs fp8 /int8 data type that is consumed by next layer as fp8/int8? Even if matmuls work with this type of chained fp8/int8 approach, you would anyway dequantize the fp8/int8 output for operations like softmax, attention, rope , layer norms etc?

For this reason, wNa8 only works for specific GPU architectures with fp8 kernels, see vllm docs here for more info. If it does support though, it will do those softmax/attention/layer_norm operations in fp8.

You are right, usually the end lm_head and beginning embedding layers are kept in fp16, I believe they will get upconverted / dequantized back to fp16 to account for that. But if we are quantizing all the middle layers, it doesn't have to do it at every layer. I am not sure how much of a computational cost this amounts to though.

nikhil-arm · 2025-04-12T12:30:56Z

Thanks @brian-dellabetta . @manaalmj I think based on this valuable feedback we can go ahead with w8a8/w4a8 for our integration.

SUMMARY: Addition of [`AWQModifier`](https://arxiv.org/pdf/2306.00978), based on [AutoAWQ implementation](https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L28). Should be reviewed/merged in conjunction with neuralmagic/compressed-tensors#269 Replaces #181 and #824 TEST PLAN: Some unit tests included, but as this was mostly a port from AutoAWQ, we validated the code by ensuring we could reproduce the evaluation metrics in Table 4 of [the paper](https://arxiv.org/pdf/2306.00978). We achieve the following wikitext PPL scores: Llama-2 7B Group 128: 1. Paper: 5.60 2. AutoAWQ: 5.615 3. This implementation: 5.612 4. we match what the paper reports for just RTN -- 5.73 5. We get reasonable results for channel-wise -- 6.788. AutoAWQ errors out for this (setting "q_group_size": -1 in the quant_config), and results not reported in paper. Llama-2 13B Group 128: 1. We match the results of AutoAWQ and the results shown in the paper: 4.97 2. We match what the paper reports for just RTN -- 4.984 NOTE: We are excluding the clipping logic in this implementation, if we want to add it we should add it as another modifier, they are mutually exclusive and the data model for AWQ doesn't align well with clipping. That might be the reason for the slight deviation of results reported in the paper and in our implementation --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

kylesayrs reviewed Mar 7, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_scheme.py Outdated Show resolved Hide resolved

brian-dellabetta mentioned this pull request Mar 10, 2025

AWQ Modifier vllm-project/llm-compressor#1177

Merged

brian-dellabetta marked this pull request as ready for review March 10, 2025 21:46

brian-dellabetta requested review from brian-dellabetta, markurtz and dsikka March 10, 2025 21:46

brian-dellabetta reviewed Mar 10, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_scheme.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/utils/helpers.py Outdated Show resolved Hide resolved

dsikka requested changes Mar 11, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_scheme.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/utils/helpers.py Outdated Show resolved Hide resolved

brian-dellabetta requested review from kylesayrs and dsikka March 11, 2025 20:27

brian-dellabetta force-pushed the awq-fixes branch from 909fd32 to afc8a7a Compare March 12, 2025 18:38

brian-dellabetta force-pushed the awq-fixes branch 2 times, most recently from fcea528 to 125ff36 Compare April 2, 2025 20:20

markurtz approved these changes Apr 2, 2025

View reviewed changes

rahul-tuli and others added 5 commits April 9, 2025 17:30

Some fixes for AWQ

877d333

revert clamp to 1e-5

a129ae8

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

rename awq quant preset to W4A16_ASYM

5712658

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

revert changes to min_vals/max_vals

399a1d0

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

only round if casting to int type

f476970

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

brian-dellabetta force-pushed the awq-fixes branch from 125ff36 to f476970 Compare April 9, 2025 17:30

brian-dellabetta approved these changes Apr 9, 2025

View reviewed changes

dsikka approved these changes Apr 10, 2025

View reviewed changes

dsikka merged commit 3a88875 into main Apr 10, 2025
1 check passed

dsikka deleted the awq-fixes branch April 10, 2025 15:44

nikhil-arm mentioned this pull request May 2, 2025

[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels vllm-project/vllm#17112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some fixes for AWQ #269

Some fixes for AWQ #269

Uh oh!

rahul-tuli commented Mar 7, 2025 •

edited by brian-dellabetta

Loading

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manaalmj commented Apr 4, 2025

Uh oh!

brian-dellabetta commented Apr 4, 2025

Uh oh!

Uh oh!

nikhil-arm commented Apr 11, 2025

Uh oh!

dsikka commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 •

edited

Loading

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 •

edited

Loading

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 •

edited

Loading

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 12, 2025

Uh oh!

Uh oh!

Some fixes for AWQ #269

Some fixes for AWQ #269

Uh oh!

Conversation

rahul-tuli commented Mar 7, 2025 • edited by brian-dellabetta Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manaalmj commented Apr 4, 2025

Uh oh!

brian-dellabetta commented Apr 4, 2025

Uh oh!

Uh oh!

nikhil-arm commented Apr 11, 2025

Uh oh!

dsikka commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Apr 11, 2025

Uh oh!

nikhil-arm commented Apr 12, 2025

Uh oh!

Uh oh!

rahul-tuli commented Mar 7, 2025 •

edited by brian-dellabetta

Loading

nikhil-arm commented Apr 11, 2025 •

edited

Loading

nikhil-arm commented Apr 11, 2025 •

edited

Loading

nikhil-arm commented Apr 11, 2025 •

edited

Loading