-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Improve model loading for compressed tensor models #36152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @SunMarc @MekkCyber for quantization! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description of the PR ! Left some minor comments
| # We expect some keys to be missing for | ||
| # compresed models | ||
| # This is fine as the weights are reconstructed by ModelCompressor | ||
| # in _process_model_after_weight_loading | ||
|
|
||
| expected_missing_keys = self.compressor.get_missing_module_keys(model) | ||
| return [ | ||
| key for key in missing_keys if not any(re.match(f".*{pattern}", key) for pattern in expected_missing_keys) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why we can't do this step with update_missing_keys and it needs to be done after loading the weights ? Also, I see that you do something similar with the unexpected_keys but this is done prior to loading the weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key reason we update missing keys (e.g., weights) after loading is that compressed-tensors' decompression depends on the correct device placement from transformers. If we filter out the weight tensors too early (before loading), they will still be on the meta device during decompression, which breaks the pre-condition for reconstruction. This can lead to issues when trying to restore the weights later.
On the other hand, unexpected keys (compression metadata) are not actual model parameters but are only used for reconstruction, so they do not depend on device placement. This is why we can safely filter them before loading the weights.
|
Also make sure to rebase the PR, this should solve the issue with the tests in the CI |
* Introduce two new hooks in HfQuantizer lifecycle to allow updates to missing and unexpected keys * Update missing and unexpected keys for stacked compressors * Add tests * Fix: run_compressed cases * Fix: uncompressed cases
Move RunCompressedTest to the same file Update tests to unittest
abdc743 to
7eb856c
Compare
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating !
) ## Purpose ## * Remove warning silencing code that was previously needed for loading quantized models but is now handled by huggingface/transformers#36152 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
This PR improves the
from_pretrainedmodel-loading pipeline in Hugging Facetransformersby suppressing unnecessary warnings for models saved using compressed tensors (e.g., sparse and quantized models).Currently, models that store compression metadata (e.g.,
bitmask,compressed,row_offsets) instead of direct weight tensors trigger misleading warnings, Due to our compression method, certain extra and missing keys are expected:This PR fixes these issues by introducing key filtering mechanisms through Hugging Face’s
hf_quantizerextension points.Problem: Unnecessary Warnings for Compressed Models
🚨 Current Behavior (Before Fix)
When loading compressed models, users encounter:
✅ Expected Behavior (After Fix)
Solution: Key Filtering for Compressed Models
🔹 Step 1: Suppress Warnings for Expected Missing Keys
🔧 Fix:
update_missing_keys_after_loading()removes weight keys (e.g.,.*weight) frommissing_keys.💡 Impact: No more misleading "missing key" warnings for compressed models.
🔹 Step 2: Suppress Warnings for Expected Unexpected Keys
🔧 Fix:
update_unexpected_keys()removes compression parameters (e.g.,bitmask,compressed,row_offsets) fromunexpected_keys.💡 Impact: No more misleading "unexpected key" warnings from compression metadata.
🔹 Step 3: Seamless Integration with Model Loading
hf_quantizer)—no changes to coretransformerscode.Lifecycle Overview: Before and After the Fix
Testing
This PR has been tested across multiple model configurations to ensure correctness:
Why This Should Be Merged
Dependencies
This PR depends on neuralmagic/compressed-tensors#250, which introduces the necessary filtering mechanisms at the compression framework level.
Conclusion
With this fix: ✅ Expected missing keys (e.g.,
.*weight) do not trigger warnings.✅ Known compression parameters do not raise unexpected key warnings.
✅ Weights are reconstructed properly in postprocessing.
✅ The Hugging Face
from_pretrainedAPI remains modular and extensible.This significantly improves the developer experience when working with compressed models. Looking forward to feedback and merging this in! 🚀