-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Description
Updated: A custom loader node for ComfyUI now available:
https://github.com/phaserblast/ComfyUI-DGXSparkSafetensorsLoader
Your question
Loading a single .safetensors files on DGX Spark causes a problem because of the mmap strategy used by the model loader. This happens even when the --disable-mmap option is used. I have been testing FLUX.2-dev, and cannot load the FP16 model despite the DGX Spark having plenty of RAM for loading both the model and text encoder. It seems mmap is a disaster on DGX Spark due to the coherent memory implementation. The result is the safetensors loader tries to load the model twice: First to "RAM," then a copy to "VRAM," which obviously fails since there is no separate RAM/VRAM and we run out of memory loading large models.
This is also a problem with llama-server and LM Studio. With mmap enabled, llama-server tries to load GGUF models into "RAM" first, then copy to "VRAM." Disabling mmap solves the problem, and models can be loaded directly into memory once without the additional move from "RAM" to "VRAM." A similar workaround with the safetensors model loader would be great, and would save time.
Update:
Everything works perfectly and as expected with the BF16 GGUF version of FLUX.2-dev from here:
https://huggingface.co/city96/FLUX.2-dev-gguf
The model loads without doubling the RAM/VRAM requirement, holding just under 90GB with the text encoder also loaded. So the problem is somewhere with the .safetensors loader.
Update 2:
I figured out a way to prevent the ballooning memory when loading a .safetensors file the normal way. It requires an edit to ComfyUI/comfy/utils.py:
if DISABLE_MMAP: # TODO: Not sure if this is the best way to bypass the mmap issues
tensor = tensor.to(device=device, copy=True)
Changing copy=True to copy=False allows the model to load without the machine running out of memory:
tensor = tensor.to(device=device, copy=False) # For DGX Spark
Total memory usage is a bit higher than the GGUF version, but inference runs slightly faster. Both models generate identical output.
Remember to launch comfy with the --disable-mmap option.
Logs
Other
No response