-
Notifications
You must be signed in to change notification settings - Fork 32k
Open
Labels
Description
System Info
transformersversion: 5.0.0- Platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 1.3.4
- Safetensors version: 0.7.0
- Accelerate version: 1.12.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu129 (CUDA)
- Using distributed or parallel set-up in script?: manual launcher
- Using GPU in script?: yes
- GPU type: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Load a large model using FSDP2 and FSDP_CPU_RAM_EFFICIENT_LOADING=True. I don't think this is model-specific, but I ran into this with Qwen/Qwen3-30B-A3B-Instruct-2507
- Watch the ram usage of all of the ranks
- See that all ranks temporarily use an amount of CPU RAM that's the size of the model being loaded
- See that the system gets stuck for a long time right after the weights are loaded
I think there are two separate but related issues happening here:
- The model loading code doesn't check for FSDP_CPU_RAM_EFFICIENT_LOADING until _move_missing_keys_from_meta_to_device, which is after the model is already loaded. At this point it discards the loaded weights and inserts new empty tensors
- These empty tensors are seen as uninitialized and will all undergo random initialization. Since they're loaded on the CPU this can take a very long time
This diff causes both of these issues to go away for me. Setting _is_hf_initialized on the new tensors seems like it might be the intended way of handling the second. I'm not sure about the first; I just used a kludge where it skips all of the loading entirely. This works but causes a lot of loud warnings about missing keys. I had tried changing get_device() to return "meta" but this crashed quickly and I didn't investigate further.
--- modeling_utils.py.orig 2026-02-04 06:59:59.107266669 +0000
+++ modeling_utils.py 2026-02-02 05:50:43.453427661 +0000
@@ -497,10 +497,11 @@
def _load_parameter_into_model(model: "PreTrainedModel", param_name: str, tensor: torch.Tensor):
"""Cast a single parameter or buffer `param_name` into the `model`, with value `tensor`."""
parent, param_type = get_module_from_name(model, param_name)
if param_type in parent._parameters and not isinstance(tensor, nn.Parameter):
tensor = nn.Parameter(tensor, requires_grad=tensor.is_floating_point())
+ tensor._is_hf_initialized = True
# We need to use setattr here, as we set non-persistent buffers as well with this function (`load_state_dict`
# does not allow to do it)
setattr(parent, param_type, tensor)
--- core_model_loading.py.orig 2026-02-04 07:03:09.751218636 +0000
+++ core_model_loading.py 2026-02-04 07:02:54.937066957 +0000
@@ -1134,10 +1134,15 @@
pattern_to_converter = {k: converter for converter in converters for k in converter.source_patterns}
state_dict = sorted(state_dict.items(), key=lambda kv: dot_natural_key(kv[0]))
+ from .integrations import is_fsdp_enabled
+ from .modeling_utils import is_local_dist_rank_0
+ if is_fsdp_enabled() and not is_local_dist_rank_0() and hf_quantizer is None:
+ state_dict = []
+
for original_key, tensor in state_dict:
# 1. Rename the key according to all renaming pattern and optional weight converter patterns
renamed_key, source_pattern = rename_source_key(
original_key, renamings, converters, prefix, meta_model_state_dict
)Expected behavior
- Only rank0 allocates a large amount of cpu ram
- The system moves quickly from weights loading to the rest of the initialization
Reactions are currently unavailable