Skip to content

Conversation

@rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Feb 12, 2025

This PR improves the from_pretrained model-loading pipeline in Hugging Face transformers by suppressing unnecessary warnings for models saved using compressed tensors (e.g., sparse and quantized models).

Currently, models that store compression metadata (e.g., bitmask, compressed, row_offsets) instead of direct weight tensors trigger misleading warnings, Due to our compression method, certain extra and missing keys are expected:

  • Extra keys: Represent compression metadata stored on disk but not explicitly expected by the model graph.
  • Missing keys: Expected in the model graph but omitted from storage since they can be reconstructed from compression metadata.

This PR fixes these issues by introducing key filtering mechanisms through Hugging Face’s hf_quantizer extension points.


Problem: Unnecessary Warnings for Compressed Models

🚨 Current Behavior (Before Fix)

When loading compressed models, users encounter:

  • "Missing key" warnings for weight tensors that will be reconstructed.
  • "Unexpected key" warnings for compression metadata in the checkpoint.

Expected Behavior (After Fix)

  • ✅ No unnecessary missing key warnings for compressed weights.
  • ✅ No unnecessary unexpected key warnings for compression metadata.
  • Seamless integration with Hugging Face’s model-loading process.

Solution: Key Filtering for Compressed Models

🔹 Step 1: Suppress Warnings for Expected Missing Keys

🔧 Fix:

  • update_missing_keys_after_loading() removes weight keys (e.g., .*weight) from missing_keys.
  • Since these weights are reconstructed later, the warning is unnecessary.

💡 Impact: No more misleading "missing key" warnings for compressed models.


🔹 Step 2: Suppress Warnings for Expected Unexpected Keys

🔧 Fix:

  • update_unexpected_keys() removes compression parameters (e.g., bitmask, compressed, row_offsets) from unexpected_keys.
  • These are metadata, not actual model weights, and should not raise warnings.

💡 Impact: No more misleading "unexpected key" warnings from compression metadata.


🔹 Step 3: Seamless Integration with Model Loading

  • ✅ Uses existing Hugging Face extension points (hf_quantizer)—no changes to core transformers code.
  • Standard models remain unaffected—only compressed models benefit.
  • Ensures genuine issues (e.g., truly missing parameters) still raise warnings.

Lifecycle Overview: Before and After the Fix

Step Before (Current Behavior) ❌ After (With Fix) ✅
6 Load checkpoint (compressed weights missing!) Load checkpoint (compressed weights missing!)
7 Identify missing keys ⚠ (Triggers Warning!) Identify missing keys (Before Filtering) ✅
8 Identify unexpected keys ⚠ (Triggers Warning!) Filter missing keys (Removes .*weight) ✅
9 Raise unnecessary warnings ❌ Identify unexpected keys (Before Filtering) ✅
10 Assign weights (weights missing!) Filter unexpected keys (Removes compression params) ✅
14 Set model to evaluation mode (with unnecessary warnings) Apply quantization postprocessing (Reconstructs weights) ✅

Testing

This PR has been tested across multiple model configurations to ensure correctness:

  • Quantized-only models – No unexpected warnings.
  • Sparse-only models – No missing or unexpected key warnings.
  • Stacked cases (both sparse and quantized) – Loads correctly without unnecessary warnings.

Why This Should Be Merged

  • 🚀 Fixes a real issue affecting compressed models.
  • No impact on standard model-loading workflows.
  • 🔧 Uses Hugging Face’s extension points—no core code changes.
  • 🔄 Maintains extensibility while improving compressed model support.

Dependencies

This PR depends on neuralmagic/compressed-tensors#250, which introduces the necessary filtering mechanisms at the compression framework level.


Conclusion

With this fix: ✅ Expected missing keys (e.g., .*weight) do not trigger warnings.
✅ Known compression parameters do not raise unexpected key warnings.
✅ Weights are reconstructed properly in postprocessing.
✅ The Hugging Face from_pretrained API remains modular and extensible.

This significantly improves the developer experience when working with compressed models. Looking forward to feedback and merging this in! 🚀

@Rocketknight1
Copy link
Member

cc @SunMarc @MekkCyber for quantization!

@MekkCyber MekkCyber self-requested a review February 13, 2025 19:54
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed description of the PR ! Left some minor comments

Comment on lines +67 to +75
# We expect some keys to be missing for
# compresed models
# This is fine as the weights are reconstructed by ModelCompressor
# in _process_model_after_weight_loading

expected_missing_keys = self.compressor.get_missing_module_keys(model)
return [
key for key in missing_keys if not any(re.match(f".*{pattern}", key) for pattern in expected_missing_keys)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why we can't do this step with update_missing_keys and it needs to be done after loading the weights ? Also, I see that you do something similar with the unexpected_keys but this is done prior to loading the weights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key reason we update missing keys (e.g., weights) after loading is that compressed-tensors' decompression depends on the correct device placement from transformers. If we filter out the weight tensors too early (before loading), they will still be on the meta device during decompression, which breaks the pre-condition for reconstruction. This can lead to issues when trying to restore the weights later.

On the other hand, unexpected keys (compression metadata) are not actual model parameters but are only used for reconstruction, so they do not depend on device placement. This is why we can safely filter them before loading the weights.

@SunMarc
Copy link
Member

SunMarc commented Feb 17, 2025

Also make sure to rebase the PR, this should solve the issue with the tests in the CI

* Introduce two new hooks in HfQuantizer lifecycle
to allow updates to missing and unexpected keys
* Update missing and unexpected keys
for stacked compressors
* Add tests
* Fix: run_compressed cases
* Fix: uncompressed cases
Move RunCompressedTest to the same file
Update tests to unittest
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating !

@SunMarc SunMarc merged commit 884a8ea into huggingface:main Feb 24, 2025
21 checks passed
kylesayrs added a commit to vllm-project/llm-compressor that referenced this pull request Mar 12, 2025
)

## Purpose ##
* Remove warning silencing code that was previously needed for loading
quantized models but is now handled by
huggingface/transformers#36152

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants