We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you introduce how to perform per-token quantization on o_proj and down_proj exactly?
per-token
o_proj
down_proj
https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310
int8_weight, weight_scale = quantize_per_tensor_absmax(module.weight) if act_quant == "per-token": alpha = weight_scale
when using per-token, the weight_scale is from quantize_per_tensor_absmax, this is a bit confusing for me.
weight_scale
quantize_per_tensor_absmax
The text was updated successfully, but these errors were encountered:
"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.
Weights always perform "per-tensor" for now.
Sorry, something went wrong.
"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj. Weights always perform "per-tensor" for now.
how can I perform partial quant like this ?
partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16
vllm-project/vllm#1508 (comment)
No branches or pull requests
Can you introduce how to perform
per-token
quantization ono_proj
anddown_proj
exactly?https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310
when using
per-token
, theweight_scale
is fromquantize_per_tensor_absmax
, this is a bit confusing for me.The text was updated successfully, but these errors were encountered: