-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference #18768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
|
@mgoin @robertgshaw2-redhat @tlrmchlsmth Can you, please, take a look at this PR and let me know if you have any comments? Thanks! |
|
|
||
| from tests.quantization.utils import is_quant_method_supported | ||
|
|
||
| MODELS = ["microsoft/Phi-3-mini-4k-instruct"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you are using Phi here, like to get around sharded weight loading? IIRC Phi models have their mergable layers like q/k/v already merged in the checkpoint as qkv_proj. I notice you override the weight loading with your RTNParameter class so I'm curious if it works with an un-merged checkpoint like Llama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, absolutely it works with any dense model, including un-merged LLama checkpoints. The Phi model is an arbitrary choice of a small dense model, happy to change it to something else
Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
|
"Yet, RTN is often believed to lag behind more advanced quantization techniques in two crucial areas – generation throughput and accuracy." How does it look like now with your latest improvement? |
You can take a look at the paper mentioned above for more detail. |
…accelerated INT4/INT8 inference (vllm-project#18768) Signed-off-by: Alex Kogan <alex.kogan@oracle.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
This PR adds basic support for RTN quantization, as a first step for supporting a calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference (see this paper for details).
RTN is a simple quantization method that does not require any calibration data nor a corresponding calibration process.
As such, it can be applied on-the-fly (i.e., while loading an original model) in a fast and cheap way, even on a system that does not have enough memory to host the original (unquantized) model. Yet, RTN is often believed to lag behind more advanced quantization techniques in two crucial areas – generation throughput and accuracy.
As this paper shows, both issues can be alleviated, through the use of efficient CUDA kernels based on Marlin (for throughput) and selective quantization (for accuracy). The latter is a simple mechanism that allows a user to select layers and/or specific linear modules that should be quantized to a higher precision. For instance, leaving just a part of one layer of Llama-3.1 70B model in 8 bit precision, while quantizing the rest of that layer and all other 79 layers into 4 bits leads to a substantially improved recovery rate, on-par with or better than other techniques:

Note that this adds less than 0.05 bits per weight on average, resulting in only insignificant memory increase.
As noted above, this PR is for basic Python-based implementation for RTN that supports quantizing models on-the-fly.
Once approved, we intend to enhance it with: