Skip to content

Conversation

@Vaibhavs10
Copy link
Member

No description provided.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, make sure you test before vs after and you show that before it would indeed overwrite the weights for a linear layer for example, if one of the keys was not loaded. So make sure you repro and show that without the fix it will fail.

@ArthurZucker ArthurZucker requested a review from Copilot November 7, 2025 15:48
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the model initialization logic to preserve non-missing parameters and buffers when loading models with partially missing weights. The key improvement is that when some (but not all) parameters of a module are missing from a checkpoint, the existing parameters are now preserved rather than being re-initialized.

  • Adds module-level initialization logic to selectively initialize only missing parameters/buffers while preserving loaded ones
  • Optimizes performance by converting missing_keys list to a set for O(1) lookup
  • Refactors the not_initialized_parameters calculation to execute earlier and allow early return

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/transformers/modeling_utils.py Implements sophisticated module-level initialization that preserves non-missing weights, adds performance optimization using sets, and refactors initialization flow
tests/test_modeling_common.py Adds two test cases to verify the new behavior: one for preserving weights when biases are missing, and one for tied decoder weights

Comment on lines 3902 to 3903
@require_torch
def test_missing_linear_bias_does_not_override_weight():
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test functions are defined at module level (outside the ModelTesterMixin class). According to the project conventions, tests should typically be added to existing test classes rather than as standalone functions. Consider moving these tests into the ModelTesterMixin class or documenting why they need to be standalone tests.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +5151 to +5154
module_paths = sorted(
set(missing_params_by_module.keys()) | set(missing_buffers_by_module.keys()),
key=lambda name: name.count("."),
reverse=True,
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting logic (by dot count in reverse order) is important for proper parent-child module initialization but lacks a comment explaining why modules must be processed from deepest to shallowest. Add a comment to clarify that this ensures child modules are initialized before parents.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tracked down the regression you pointed out. On upstream main, the linear head weight really does get re-randomized when only the bias is missing:

  git worktree add /tmp/transformers-upstream upstream/main
  PYTHONPATH=/tmp/transformers-upstream/src python - <<'PY'
  import torch
  from transformers import BertConfig, BertForSequenceClassification

  config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
  num_hidden_layers=1, num_labels=2)
  model = BertForSequenceClassification(config)
  with torch.no_grad():
      sentinel = torch.arange(model.classifier.weight.numel(),
  dtype=torch.float32).view_as(model.classifier.weight)
      model.classifier.weight.copy_(sentinel)
      model.classifier.bias.zero_()

  model.classifier.weight._is_hf_initialized = True
  if hasattr(model.classifier.bias, "_is_hf_initialized"):
      delattr(model.classifier.bias, "_is_hf_initialized")
  model.classifier._is_hf_initialized = False

  model._initialize_missing_keys(['classifier.bias'], is_quantized=False)

  print('old diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
  PY

  Output:

  old diff 15.002209663391113

  With the patch in this branch:

  python - <<'PY'
  import torch
  from transformers import BertConfig, BertForSequenceClassification

  config = BertConfig(hidden_size=8, intermediate_size=16, num_attention_heads=2,
  num_hidden_layers=1, num_labels=2)
  model = BertForSequenceClassification(config)
  with torch.no_grad():
      sentinel = torch.arange(model.classifier.weight.numel(),
  dtype=torch.float32).view_as(model.classifier.weight)
      model.classifier.weight.copy_(sentinel)
      model.classifier.bias.zero_()

  model.classifier.weight._is_hf_initialized = True
  if hasattr(model.classifier.bias, "_is_hf_initialized"):
      delattr(model.classifier.bias, "_is_hf_initialized")
  model.classifier._is_hf_initialized = False

  model._initialize_missing_keys(['classifier.bias'], is_quantized=False)

  print('new diff', torch.max(torch.abs(model.classifier.weight - sentinel)).item())
  PY

  Output:

  new diff 0.0

To keep this covered going forward I also added a regression test
(test_missing_linear_bias_preserves_weight_custom_model) that reproduces the exact flag
state the loader sets; it fails on main and passes here.

@Cyrilvallez
Copy link
Member

Note sure what were the discussions, but just wanna pin-point that this is known on main and is not an issue as param loading happens after the call to _initialize_missing_keys, so even if we indeed overwrite some params (because we can only init full modules), they are then correctly loaded afterwards

TLDR _initialize_missing_keys is never called alone, always BEFORE the actual loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants