-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Transform] [Quantization] Add transforms to compressed tensors #22486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for transforms within the compressed tensors quantization framework. The changes primarily involve updating CompressedTensorsConfig to manage transform configurations and modifying the linear method to apply these transforms during the forward pass. My review has identified two critical issues that need to be addressed. First, the logic for creating transform factories incorrectly overwrites the factories dictionary within a loop, which will lead to incorrect behavior when multiple transform schemes are present. Second, there is a signature mismatch in the call to the newly added is_match function, which will cause a TypeError at runtime.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Outdated
Show resolved
Hide resolved
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
d91886d to
43016eb
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
496527e to
5954ccc
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's run CI and see if it is correct
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
…s won't Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Show resolved
Hide resolved
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
## Purpose ## * Support R4 transforms before R3. R3 requires hooking into the attention module, where as R4 does not ## Prerequisites ## * vllm-project/vllm#22486 ## Testing ## * Performed sanity checks with HF and vLLM --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Online Hadamard Rotations
Purpose
Changes
Added Transforms Weight Loading
SharedWeightParameterHadamardTransformmodule is attached to linear layers to load transform weights which utilizesSharedWeightParameterin order to load weight partitions as separate tensorsAdded Transforms Apply
CompressedTensorsLinearTransformMethodwhich wrapsCompressedTensorsLinearMethodandUnquantizedLinearMethodadds input and output transforms to either side of the originalapplymethodUnquantizedLinearMethod#23036 has not landed, we must use a hack to switch back to weight_loader_v1 if the method being wrapped is the UnquantizedLinearMethodMisc Changes
_shard_id_as_intonBasevLLMParameter. This is so that its implementation can be shared withSharedWeightParametercalculate_prompt_perplexityfor checking model coherenceweight_loader_v1toQKVCrossParallelLinearto support hack described above (and [Core] Support weight_loader_v2 forUnquantizedLinearMethod#23036)Testing
test_compressed_tensors_transforms_perplexitytests for SpinQuantR1R2R4 and QuIP