Skip to content

Add AWQ support #530

Closed
Closed
@jerryzh168

Description

@jerryzh168

AWQ seems popular: 3000 appearances in huggingface models: (https://huggingface.co/models?sort=trending&search=AWQ), similar to GPTQ. Maybe we can add this to torchao as well.

Overview

At the high level, AWQ tries to scale weight based on some power of average per channel magnitude of activation (Sx^(alpha)) as mentioned in the paper, where Sx is the average magnitude of activation (per-channel).

Implementation in original awq repo

Main things are finding scale and applying scale to weights.

Note: In original awq implementation, the logic of finding scale is a bit complicated, but that's mainly to deal with the separate qkv modules. we could start by just implementing awq for simple linears, and worry about the more complicated model structures later.

For applying the scales, in the original impl, we have to manually specify what is the prev_module, we could do the same, or we can symbolic trace the model (to preserve all call_modules) in order to figure out the relationship between different modules programmably.

How to implement it in torchao

First, I think we can focus on implementing AWQ for linear module only, we can get the activation stats using observers, and search for alpha parameter based on the output of the quantized linear module as well, we can reuse the existing quant_primitives for affine quantization in torchao.

Step 1. Collecting Observer Stats

In terms of collecting activation stats, we could follow what we did in

class ObservedLinear(torch.nn.Linear):
def __init__(self, in_features: int, out_features: int, act_obs: torch.nn.Module, weight_obs: torch.nn.Module, bias: bool = True, device=None, dtype=None):
super().__init__(in_features, out_features, bias, device, dtype)
self.act_obs = act_obs
self.weight_obs = weight_obs
def forward(self, input: Tensor):
observed_input = self.act_obs(input)
observed_weight = self.weight_obs(self.weight)
return F.linear(observed_input, observed_weight, self.bias)
@classmethod
def from_float(cls, float_linear, act_obs, weight_obs):
observed_linear = cls(float_linear.in_features, float_linear.out_features, act_obs, weight_obs, False, device=float_linear.weight.device, dtype=float_linear.weight.dtype)
observed_linear.weight = float_linear.weight
observed_linear.bias = float_linear.bias
return observed_linear
, we can implement a similar ObservedLinear with observer (or just a logger) to log the activation(s)

we can create a function insert_awq_observers_ similar to

def insert_observers_(model, act_obs, weight_obs):

Step 2. Integrate with AffineQuantizedTensor

Calculating per channel scale can happen when we apply quantization to the weights, similar to:

# weight quantization
weight_scale, weight_zero_point = observed_linear.weight_obs.calculate_qparams()
def weight_quant_func(weight):
block_size = (1, weight.shape[1])
return to_affine_quantized_static(weight, weight_scale, weight_zero_point, block_size, target_dtype)
linear = torch.nn.Linear(observed_linear.in_features, observed_linear.out_features, False, device=observed_linear.weight.device, dtype=observed_linear.weight.dtype)
linear.weight = observed_linear.weight
linear.bias = observed_linear.bias
linear.weight = torch.nn.Parameter(weight_quant_func(linear.weight), requires_grad=False)
# activation quantization
act_scale, act_zero_point = observed_linear.act_obs.calculate_qparams()
input_quant_func = lambda x: to_affine_quantized_static(x, act_scale, act_zero_point, x.shape, target_dtype)
linear.weight = torch.nn.Parameter(to_linear_activation_quantized(linear.weight, input_quant_func), requires_grad=False)

As discussed with @vayuda in CUDA_MODE, I think we could implement a new LayoutType and AQTLayout that will scale the weight with equalization_scale before quantization, and can apply the equalization_scale tensor to input activation tensor in linear operator. (Note: I think we should call this equalization_scale because it's not AWQ only, smoothquant can resue this)

In terms of API, we can implement some helper function like

def int4_weight_only(group_size=128, inner_k_tiles=8):
to support any configurations.

Note: We may be able to fuse equalization_scale to the kernel as well, but our current A16W4 kernel is implemented in tinygemm, so we'd need to modify tinygemm kernels, if we are relying on torch.compile, it would be easy to do.

Additional Optimizations

Turn Input-Weight Equalization to Cross Layer Equalization

As we can see from original implementation when applying the scale to linear weights, we applied the scale to the current linear weight and the weight of the previous module, this is only applicable if the previous operation satisfies:

f(sx) = sf(x)

see Section 4.1 of https://arxiv.org/pdf/1906.04721 for more details.

But this could be true for many use cases. To safely apply this optimization, we could do a the following:

model = torch.fx.symbolic_trace(model)
named_modules = dict(model.named_modules(remove_duplicate=False))
for n in model.graph.node:
    if n.op == "call_module":
        module = named_modules[n.target]
        if not isinstance(module, torch.nn.Linear):
            continue
        # check the previous module, have an allowlist of modules that we can apply the scale, and
        # and change the layout of the weight tensor from AWQ layout to normal TensorCoreTiled layout

see https://pytorch.org/docs/stable/fx.html for docs related to torch.fx

Logistics (Code Location, Test and Benchmarks)

Please create an awq folder under https://github.com/pytorch/ao/tree/main/torchao/prototype
The flow and layout implementation can be in separate files, e.g. flow.py, layout.py (there might be some missing extension points of AffineQuantizedTensor, but we'll work on these at the same time)

For Testing, please create a test_awq.py in https://github.com/pytorch/ao/tree/main/test/prototype
we can test basic insert_awq_observers_ flow and also the layout creation etc.

For e2e flow demo, please add a awq.py in https://github.com/pytorch/ao/tree/main/tutorials/calibration_flow
following the static quant example, please show the benchmarking result as well (since we are using optimized kernel) following https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization-flow-example

Last step is to test this with llama2/llama3 following instructions in https://github.com/pytorch/ao/tree/main/torchao/_models/llama and measure the metrics in https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks if you have GPU machines.

References

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions