FSDP_CPU_RAM_EFFICIENT_LOADING broken

### System Info

- `transformers` version: 5.0.0
- Platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 1.3.4
- Safetensors version: 0.7.0
- Accelerate version: 1.12.0
- Accelerate config: 	not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu129 (CUDA)
- Using distributed or parallel set-up in script?: manual launcher
- Using GPU in script?: yes
- GPU type: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

### Who can help?

@ArthurZucker @Cyrilvallez

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

1. Load a large model using FSDP2 and FSDP_CPU_RAM_EFFICIENT_LOADING=True. I don't think this is model-specific, but I ran into this with Qwen/Qwen3-30B-A3B-Instruct-2507
2. Watch the ram usage of all of the ranks
3. See that all ranks temporarily use an amount of CPU RAM that's the size of the model being loaded
4. See that the system gets stuck for a long time right after the weights are loaded


I think there are two separate but related issues happening here:
- The model loading code doesn't check for FSDP_CPU_RAM_EFFICIENT_LOADING until _move_missing_keys_from_meta_to_device, which is after the model is already loaded. At this point it discards the loaded weights and inserts new empty tensors
- These empty tensors are seen as uninitialized and will all undergo random initialization. Since they're loaded on the CPU this can take a very long time


This diff causes both of these issues to go away for me. Setting _is_hf_initialized on the new tensors seems like it might be the intended way of handling the second. I'm not sure about the first; I just used a kludge where it skips all of the loading entirely. This works but causes a lot of loud warnings about missing keys. I had tried changing get_device() to return "meta" but this crashed quickly and I didn't investigate further.

```diff
--- modeling_utils.py.orig      2026-02-04 06:59:59.107266669 +0000
+++ modeling_utils.py   2026-02-02 05:50:43.453427661 +0000
@@ -497,10 +497,11 @@
 def _load_parameter_into_model(model: "PreTrainedModel", param_name: str, tensor: torch.Tensor):
     """Cast a single parameter or buffer `param_name` into the `model`, with value `tensor`."""
     parent, param_type = get_module_from_name(model, param_name)
     if param_type in parent._parameters and not isinstance(tensor, nn.Parameter):
         tensor = nn.Parameter(tensor, requires_grad=tensor.is_floating_point())
+        tensor._is_hf_initialized = True
     # We need to use setattr here, as we set non-persistent buffers as well with this function (`load_state_dict`
     # does not allow to do it)
     setattr(parent, param_type, tensor)
--- core_model_loading.py.orig  2026-02-04 07:03:09.751218636 +0000
+++ core_model_loading.py       2026-02-04 07:02:54.937066957 +0000
@@ -1134,10 +1134,15 @@
 
     pattern_to_converter = {k: converter for converter in converters for k in converter.source_patterns}
 
     state_dict = sorted(state_dict.items(), key=lambda kv: dot_natural_key(kv[0]))
 
+    from .integrations import is_fsdp_enabled
+    from .modeling_utils import is_local_dist_rank_0
+    if is_fsdp_enabled() and not is_local_dist_rank_0() and hf_quantizer is None:
+        state_dict = []
+
     for original_key, tensor in state_dict:
         # 1. Rename the key according to all renaming pattern and optional weight converter patterns
         renamed_key, source_pattern = rename_source_key(
             original_key, renamings, converters, prefix, meta_model_state_dict
         )
```


### Expected behavior

- Only rank0 allocates a large amount of cpu ram
- The system moves quickly from weights loading to the rest of the initialization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP_CPU_RAM_EFFICIENT_LOADING broken #43749

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP_CPU_RAM_EFFICIENT_LOADING broken #43749

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions