Skip to content

Loading .safetensors files requires double memory on DGX Spark #10896

@phaserblast

Description

@phaserblast

Updated: A custom loader node for ComfyUI now available:

https://github.com/phaserblast/ComfyUI-DGXSparkSafetensorsLoader

Your question

Loading a single .safetensors files on DGX Spark causes a problem because of the mmap strategy used by the model loader. This happens even when the --disable-mmap option is used. I have been testing FLUX.2-dev, and cannot load the FP16 model despite the DGX Spark having plenty of RAM for loading both the model and text encoder. It seems mmap is a disaster on DGX Spark due to the coherent memory implementation. The result is the safetensors loader tries to load the model twice: First to "RAM," then a copy to "VRAM," which obviously fails since there is no separate RAM/VRAM and we run out of memory loading large models.

This is also a problem with llama-server and LM Studio. With mmap enabled, llama-server tries to load GGUF models into "RAM" first, then copy to "VRAM." Disabling mmap solves the problem, and models can be loaded directly into memory once without the additional move from "RAM" to "VRAM." A similar workaround with the safetensors model loader would be great, and would save time.

Update:

Everything works perfectly and as expected with the BF16 GGUF version of FLUX.2-dev from here:
https://huggingface.co/city96/FLUX.2-dev-gguf
The model loads without doubling the RAM/VRAM requirement, holding just under 90GB with the text encoder also loaded. So the problem is somewhere with the .safetensors loader.

Update 2:

I figured out a way to prevent the ballooning memory when loading a .safetensors file the normal way. It requires an edit to ComfyUI/comfy/utils.py:

if DISABLE_MMAP:  # TODO: Not sure if this is the best way to bypass the mmap issues
    tensor = tensor.to(device=device, copy=True)

Changing copy=True to copy=False allows the model to load without the machine running out of memory:
tensor = tensor.to(device=device, copy=False) # For DGX Spark
Total memory usage is a bit higher than the GGUF version, but inference runs slightly faster. Both models generate identical output.

Remember to launch comfy with the --disable-mmap option.

Logs

Other

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    User SupportA user needs help with something, probably not a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions