-
Notifications
You must be signed in to change notification settings - Fork 11k
Unload to mmap when the CPU mem is low #11289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…refine_offload
* allow offload quant * rm cuda * refine and pass test
|
Would it not be better to just drop the memory and reload tensors from the model file on disk that you already have? This looks like a manual implementation of a swap file; I don't think it's really helpful to write things to the disk. If you're going to do that, it will be easier and probably more efficient to just add swap space. I would like it if ComfyUI had the ability to drop tensors from memory when RAM runs low and reload them on demand, but if it involves disk writes, it's not going to be more efficient than OS swapping. I guess if you could load a checkpoint using mmap in a way that allows you to "return" mmaped memory to the disk (optimally letting the OS decide if it's actually in memory or on disk), that would solve the problem pretty neatly. |
|
Agree. This is the same behaviour as swap. I actually have some work-in-progress for what @asagi4 describes by just going back to the model file in this scenario. |
If I may interject. There is also the scenario in which the model safetensors are on a disk slower than the disk where the swap file is. For example I keep the swap file on an NVMe SSD while the model is on a SATA SSD. I prefer to reload the data from the swap file (NVMe speed) than from the model (SATA speed). Not to mention that keeping all the data in "RAM" (meaning actual RAM + swap) I think allows Windows to use memory compression on what it can apply it (for example my compressed area reaches about 17+ GB in wan 2.2 workflows). |
Yeah i've thought about this case too. I think this setup is however the exception rather than the rule. The majority of users will have either no swap or a same-disk swap and have a lot to gain by just ditching that write-on-unload completely. Note that even if this change was made your use case can still be handled by configuring your NVME as a vanilla disk cache on top of your model library. |
This is something I'm unaware how can be done. Can it be done now or it will be a starting argument in the future? Anyway, even if I can assign a ComfyUI model cache to the fastest drive it will still be a cache separated from windows swap. Maybe I'm wrong or stubborn but I would like to still have the option to let windows manage its "continuous" RAM + swap space if the future modifications don't prove themselves being beneficial for my case. I bought the NVME for the single purpose to make it some sort of "RAM expansion", it was the best I could think of after I maxed out what my motherboard can accommodate (64 GB RAM). |
|
Just use --cache-ram x.x. It defaults to 4.0 I think. Let's say you have 32gb ram, 4.0 would mean if offloading were going to cause it to exceed 28gb, it will dump cached things with a priority and then you'll just reload them if they are needed again. What you're trying to implement, like others have stated, is essentially yet another page file. This means more wear on SSDs and would likely be much slower than just reloading a model again and that's counting it doing all the initial processing of loading the model weights.
And you'll see your drive's health rapidly decline when you're writing 10s of gigabytes constantly. |
ComfyUI’s unload and load operations work on the parameters and buffers of an nn.Module, and typically only a subset of these tensors is unloaded. Releasing and restoring selected tensors from disk is complex and affects many parts of the existing unload/load implementation. In contrast, a tensor backed by an mmap storage behaves like a regular CPU tensor and can be moved to CPU or CUDA easily using the standard Tensor.to operation. This simplicity is the primary reason for using mmap-backed tensors.
You are right that both mmap and swap operate on VM pages, but swap is managed by the OS and only handles anonymous memory. In ComfyUI, most model weights are file-backed or explicitly managed tensors. By unloading them manually into a dedicated mmap file, we can control what gets evicted and when, instead of relying on global OS heuristics across all processes. This becomes important because CPU OOM often crashes the entire ComfyUI process. The mmap offload provides a predictable, controlled way to prevent such crashes, which swap alone may not reliably avoid in these scenarios. When CPU RAM approaches its limit, a GPU tensor will be offloaded to a dedicated mmap file on disk instead of causing an OOM in CPU or GPU memory. |
Going back to the original model file is indeed the most straightforward approach for full unloads. However, when only part of the model’s parameters or buffers need to be unloaded, selectively loading them back from model file will become complicated? |
When CPU memory is limited, unloading a GPU tensor back to CPU RAM may trigger OOM and crash the ComfyUI process.
This PR adds support for unloading models to an MMAP-backed disk file instead of CPU memory. By offloading weights to disk, it prevents CPU OOM conditions and avoids ComfyUI crashes during model unloading.
MMAP-backed tensors can be moved to GPU using the standard
to("cuda")operation, making them straightforward to reload.Usage
If the available CPU memory is below this threshold (e.g., 5 GB), ComfyUI will offload model weights to an MMAP (disk-based) file rather than CPU RAM during unloading, thereby avoiding CPU memory exhaustion.