Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the representation of finer quantization granularities than per-axis #1569

Open
sdasgup3 opened this issue Jun 2, 2023 · 8 comments
Assignees

Comments

@sdasgup3
Copy link
Member

sdasgup3 commented Jun 2, 2023

The issue is based on the discussion and interest shown by the community to explore the representation of finer quantization granularities for quantization parameters in StableHLO.

Please refer to the discussion link for additional references.

@powderluv
Copy link

Also AWQ that apparently is better than gptq https://arxiv.org/abs/2306.00978

@sdasgup3
Copy link
Member Author

sdasgup3 commented Jun 9, 2023

@powderluv Thanks for providing the link! Indeed an interesting read.

This is what I understood from the paper:

There is a small fraction of salient weights that are much more important for LLM's performance compared to others. To find the salient weight channels, they refer to the activation distribution rather than the weight distribution: weight channels corresponding to larger activation magnitudes are more salient since they process more important features. The paper talks about relying on input activation magnitude to pick out the salient weight channels and their corresponding scales. For non-salient weight channels use the weight magnitude to get the corresponding scales.

One thing that I was wondering about: From the POV of expressing the quantization parameters in StableHLO, this seems similar to the per-axis scaling scheme, which StableHLO currently supports. IMO, the novelty here lies in how the the scales are computed for each channel.

Please let me know if am missing some point.

@sunshinemyson
Copy link

Do we have any news on this topic?

@sdasgup3
Copy link
Member Author

Hello @sunshinemyson
There are ongoing planning around supporting subchannel and blockwise schemes for specifying the quantization parameters. I will be updating this issue sometime by the end of this month. In the meanwhile do you mind elaborating your use case? I am wondering if that is something in line with #1535 (comment)

@sunshinemyson
Copy link

Hi @sdasgup3 ,

Thanks for your reply. Actually we are working on the GPTQ too, according to our expr with Llama model, GPTQ can reach promise result.

Looking forward updates.

Thanks

@sunshinemyson
Copy link

Hi @sdasgup3 ,

Would you mind sharing your progress on this topic? Recently, I found the AWQ/GPTQ very popular when I create my own LLM application locally. You can find more quantized model from https://huggingface.co/TheBloke.

Thanks

@sdasgup3
Copy link
Member Author

Hi @sunshinemyson
I should have updated this earlier. Sorry for the delayed response.

I found the AWQ/GPTQ very popular when I create my own LLM application locally. You can find more quantized model from https://huggingface.co/TheBloke.

Thanks for sharing for use-cases!

We are gathering the requirements from stakeholders to use them as a basis for the spec changes for multi-dimenstional per-axis and sub-channel support. I strongly hope to come up with a plan by early next month. Please stay tuned.

@sdasgup3
Copy link
Member Author

Just wanted to update here that we still have plans to take this on in Q2'24. I will keep the updates posted.

@sdasgup3 sdasgup3 moved this from Ready to Backlog in StableHLO v1.0 Release Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Backlog
Development

No branches or pull requests

4 participants