Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about per-token quant #9

Open
Hongbosherlock opened this issue Feb 2, 2024 · 2 comments
Open

Question about per-token quant #9

Hongbosherlock opened this issue Feb 2, 2024 · 2 comments

Comments

@Hongbosherlock
Copy link

Can you introduce how to perform per-token quantization on o_proj and down_proj exactly?

https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310

int8_weight, weight_scale = quantize_per_tensor_absmax(module.weight)
        if act_quant == "per-token":
            alpha = weight_scale

when using per-token, the weight_scale is from quantize_per_tensor_absmax, this is a bit confusing for me.

@AniZpZ
Copy link
Owner

AniZpZ commented Feb 4, 2024

"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.

Weights always perform "per-tensor" for now.

@Hongbosherlock
Copy link
Author

"per-token" is not applied to weights, it applies to the activations of o_proj and down_proj.

Weights always perform "per-tensor" for now.

how can I perform partial quant like this ?

partial quant 1: only down_proj uses fp16
partial quant 2: both o_proj and down_proj use fp16

vllm-project/vllm#1508 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants