-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Fix the issue Arthur was talking about. #42092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, make sure you test before vs after and you show that before it would indeed overwrite the weights for a linear layer for example, if one of the keys was not loaded. So make sure you repro and show that without the fix it will fail.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the model initialization logic to preserve non-missing parameters and buffers when loading models with partially missing weights. The key improvement is that when some (but not all) parameters of a module are missing from a checkpoint, the existing parameters are now preserved rather than being re-initialized.
- Adds module-level initialization logic to selectively initialize only missing parameters/buffers while preserving loaded ones
- Optimizes performance by converting
missing_keyslist to a set for O(1) lookup - Refactors the
not_initialized_parameterscalculation to execute earlier and allow early return
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/transformers/modeling_utils.py |
Implements sophisticated module-level initialization that preserves non-missing weights, adds performance optimization using sets, and refactors initialization flow |
tests/test_modeling_common.py |
Adds two test cases to verify the new behavior: one for preserving weights when biases are missing, and one for tied decoder weights |
tests/test_modeling_common.py
Outdated
| @require_torch | ||
| def test_missing_linear_bias_does_not_override_weight(): |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test functions are defined at module level (outside the ModelTesterMixin class). According to the project conventions, tests should typically be added to existing test classes rather than as standalone functions. Consider moving these tests into the ModelTesterMixin class or documenting why they need to be standalone tests.
| module_paths = sorted( | ||
| set(missing_params_by_module.keys()) | set(missing_buffers_by_module.keys()), | ||
| key=lambda name: name.count("."), | ||
| reverse=True, |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting logic (by dot count in reverse order) is important for proper parent-child module initialization but lacks a comment explaining why modules must be processed from deepest to shallowest. Add a comment to clarify that this ensures child modules are initialized before parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tracked down the regression you pointed out. On upstream main, the linear head weight really does get re-randomized when only the bias is missing:
git worktree add /tmp/transformers-upstream upstream/main
PYTHONPATH=/tmp/transformers-upstream/src python - <<'PY'
import torch
from transformers import BertConfig, BertForSequenceClassification
config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
num_hidden_layers=1, num_labels=2)
model = BertForSequenceClassification(config)
with torch.no_grad():
sentinel = torch.arange(model.classifier.weight.numel(),
dtype=torch.float32).view_as(model.classifier.weight)
model.classifier.weight.copy_(sentinel)
model.classifier.bias.zero_()
model.classifier.weight._is_hf_initialized = True
if hasattr(model.classifier.bias, "_is_hf_initialized"):
delattr(model.classifier.bias, "_is_hf_initialized")
model.classifier._is_hf_initialized = False
model._initialize_missing_keys(['classifier.bias'], is_quantized=False)
print('old diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
PY
Output:
old diff 15.002209663391113
With the patch in this branch:
python - <<'PY'
import torch
from transformers import BertConfig, BertForSequenceClassification
config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
num_hidden_layers=1, num_labels=2)
model = BertForSequenceClassification(config)
with torch.no_grad():
sentinel = torch.arange(model.classifier.weight.numel(),
dtype=torch.float32).view_as(model.classifier.weight)
model.classifier.weight.copy_(sentinel)
model.classifier.bias.zero_()
model.classifier.weight._is_hf_initialized = True
if hasattr(model.classifier.bias, "_is_hf_initialized"):
delattr(model.classifier.bias, "_is_hf_initialized")
model.classifier._is_hf_initialized = False
model._initialize_missing_keys(['classifier.bias'], is_quantized=False)
print('new diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
PY
Output:
new diff 0.0
To keep this covered going forward I also added a regression test
(test_missing_linear_bias_preserves_weight_custom_model) that reproduces the exact flag
state the loader sets; it fails on main and passes here.
|
Note sure what were the discussions, but just wanna pin-point that this is known on TLDR |
No description provided.