Fix the issue Arthur was talking about. #42092

Vaibhavs10 · 2025-11-07T15:38:52Z

No description provided.

ArthurZucker

Cool, make sure you test before vs after and you show that before it would indeed overwrite the weights for a linear layer for example, if one of the keys was not loaded. So make sure you repro and show that without the fix it will fail.

HuggingFaceDocBuilderDev · 2025-11-07T15:48:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull Request Overview

This PR enhances the model initialization logic to preserve non-missing parameters and buffers when loading models with partially missing weights. The key improvement is that when some (but not all) parameters of a module are missing from a checkpoint, the existing parameters are now preserved rather than being re-initialized.

Adds module-level initialization logic to selectively initialize only missing parameters/buffers while preserving loaded ones
Optimizes performance by converting missing_keys list to a set for O(1) lookup
Refactors the not_initialized_parameters calculation to execute earlier and allow early return

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/transformers/modeling_utils.py`	Implements sophisticated module-level initialization that preserves non-missing weights, adds performance optimization using sets, and refactors initialization flow
`tests/test_modeling_common.py`	Adds two test cases to verify the new behavior: one for preserving weights when biases are missing, and one for tied decoder weights

Copilot · 2025-11-07T15:58:25Z

tests/test_modeling_common.py

+@require_torch
+def test_missing_linear_bias_does_not_override_weight():


These test functions are defined at module level (outside the ModelTesterMixin class). According to the project conventions, tests should typically be added to existing test classes rather than as standalone functions. Consider moving these tests into the ModelTesterMixin class or documenting why they need to be standalone tests.

Copilot · 2025-11-07T15:58:25Z

src/transformers/modeling_utils.py

+            module_paths = sorted(
+                set(missing_params_by_module.keys()) | set(missing_buffers_by_module.keys()),
+                key=lambda name: name.count("."),
+                reverse=True,


The sorting logic (by dot count in reverse order) is important for proper parent-child module initialization but lacks a comment explaining why modules must be processed from deepest to shallowest. Add a comment to clarify that this ensures child modules are initialized before parents.

src/transformers/modeling_utils.py

Vaibhavs10

I tracked down the regression you pointed out. On upstream main, the linear head weight really does get re-randomized when only the bias is missing:

  git worktree add /tmp/transformers-upstream upstream/main
  PYTHONPATH=/tmp/transformers-upstream/src python - <<'PY'
  import torch
  from transformers import BertConfig, BertForSequenceClassification

  config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
  num_hidden_layers=1, num_labels=2)
  model = BertForSequenceClassification(config)
  with torch.no_grad():
      sentinel = torch.arange(model.classifier.weight.numel(),
  dtype=torch.float32).view_as(model.classifier.weight)
      model.classifier.weight.copy_(sentinel)
      model.classifier.bias.zero_()

  model.classifier.weight._is_hf_initialized = True
  if hasattr(model.classifier.bias, "_is_hf_initialized"):
      delattr(model.classifier.bias, "_is_hf_initialized")
  model.classifier._is_hf_initialized = False

  model._initialize_missing_keys(['classifier.bias'], is_quantized=False)

  print('old diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
  PY

  Output:

  old diff 15.002209663391113

  With the patch in this branch:

  python - <<'PY'
  import torch
  from transformers import BertConfig, BertForSequenceClassification

  config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
  num_hidden_layers=1, num_labels=2)
  model = BertForSequenceClassification(config)
  with torch.no_grad():
      sentinel = torch.arange(model.classifier.weight.numel(),
  dtype=torch.float32).view_as(model.classifier.weight)
      model.classifier.weight.copy_(sentinel)
      model.classifier.bias.zero_()

  model.classifier.weight._is_hf_initialized = True
  if hasattr(model.classifier.bias, "_is_hf_initialized"):
      delattr(model.classifier.bias, "_is_hf_initialized")
  model.classifier._is_hf_initialized = False

  model._initialize_missing_keys(['classifier.bias'], is_quantized=False)

  print('new diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
  PY

  Output:

  new diff 0.0

To keep this covered going forward I also added a regression test
(test_missing_linear_bias_preserves_weight_custom_model) that reproduces the exact flag
state the loader sets; it fails on main and passes here.

Cyrilvallez · 2025-11-07T16:43:21Z

Note sure what were the discussions, but just wanna pin-point that this is known on main and is not an issue as param loading happens after the call to _initialize_missing_keys, so even if we indeed overwrite some params (because we can only init full modules), they are then correctly loaded afterwards

TLDR _initialize_missing_keys is never called alone, always BEFORE the actual loading

Fix the issue Arthur was talking about.

79380f8

ArthurZucker reviewed Nov 7, 2025

View reviewed changes

ArthurZucker requested a review from Copilot November 7, 2025 15:48

up

d461af1

Copilot AI reviewed Nov 7, 2025

View reviewed changes

Add regression test for missing bias reinit

6b2ce97

Vaibhavs10 commented Nov 7, 2025

View reviewed changes

Address review feedback

6a96bd8

Handle fully missing state dict by deferring to initialize_weights

5f96b9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the issue Arthur was talking about. #42092

Fix the issue Arthur was talking about. #42092

Vaibhavs10 commented Nov 7, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

Vaibhavs10 left a comment •

edited

Loading

Uh oh!

Cyrilvallez commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@require_torch
		def test_missing_linear_bias_does_not_override_weight():

Fix the issue Arthur was talking about. #42092

Are you sure you want to change the base?

Fix the issue Arthur was talking about. #42092

Conversation

Vaibhavs10 commented Nov 7, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vaibhavs10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Vaibhavs10 left a comment •

edited

Loading