Skip to content

FSDP_CPU_RAM_EFFICIENT_LOADING broken #43749

@kmod

Description

@kmod

System Info

  • transformers version: 5.0.0
  • Platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 1.3.4
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1+cu129 (CUDA)
  • Using distributed or parallel set-up in script?: manual launcher
  • Using GPU in script?: yes
  • GPU type: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Load a large model using FSDP2 and FSDP_CPU_RAM_EFFICIENT_LOADING=True. I don't think this is model-specific, but I ran into this with Qwen/Qwen3-30B-A3B-Instruct-2507
  2. Watch the ram usage of all of the ranks
  3. See that all ranks temporarily use an amount of CPU RAM that's the size of the model being loaded
  4. See that the system gets stuck for a long time right after the weights are loaded

I think there are two separate but related issues happening here:

  • The model loading code doesn't check for FSDP_CPU_RAM_EFFICIENT_LOADING until _move_missing_keys_from_meta_to_device, which is after the model is already loaded. At this point it discards the loaded weights and inserts new empty tensors
  • These empty tensors are seen as uninitialized and will all undergo random initialization. Since they're loaded on the CPU this can take a very long time

This diff causes both of these issues to go away for me. Setting _is_hf_initialized on the new tensors seems like it might be the intended way of handling the second. I'm not sure about the first; I just used a kludge where it skips all of the loading entirely. This works but causes a lot of loud warnings about missing keys. I had tried changing get_device() to return "meta" but this crashed quickly and I didn't investigate further.

--- modeling_utils.py.orig      2026-02-04 06:59:59.107266669 +0000
+++ modeling_utils.py   2026-02-02 05:50:43.453427661 +0000
@@ -497,10 +497,11 @@
 def _load_parameter_into_model(model: "PreTrainedModel", param_name: str, tensor: torch.Tensor):
     """Cast a single parameter or buffer `param_name` into the `model`, with value `tensor`."""
     parent, param_type = get_module_from_name(model, param_name)
     if param_type in parent._parameters and not isinstance(tensor, nn.Parameter):
         tensor = nn.Parameter(tensor, requires_grad=tensor.is_floating_point())
+        tensor._is_hf_initialized = True
     # We need to use setattr here, as we set non-persistent buffers as well with this function (`load_state_dict`
     # does not allow to do it)
     setattr(parent, param_type, tensor)
--- core_model_loading.py.orig  2026-02-04 07:03:09.751218636 +0000
+++ core_model_loading.py       2026-02-04 07:02:54.937066957 +0000
@@ -1134,10 +1134,15 @@
 
     pattern_to_converter = {k: converter for converter in converters for k in converter.source_patterns}
 
     state_dict = sorted(state_dict.items(), key=lambda kv: dot_natural_key(kv[0]))
 
+    from .integrations import is_fsdp_enabled
+    from .modeling_utils import is_local_dist_rank_0
+    if is_fsdp_enabled() and not is_local_dist_rank_0() and hf_quantizer is None:
+        state_dict = []
+
     for original_key, tensor in state_dict:
         # 1. Rename the key according to all renaming pattern and optional weight converter patterns
         renamed_key, source_pattern = rename_source_key(
             original_key, renamings, converters, prefix, meta_model_state_dict
         )

Expected behavior

  • Only rank0 allocates a large amount of cpu ram
  • The system moves quickly from weights loading to the rest of the initialization

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions