Skip to content

Conversation

@strint
Copy link

@strint strint commented Dec 12, 2025

When CPU memory is limited, unloading a GPU tensor back to CPU RAM may trigger OOM and crash the ComfyUI process.

This PR adds support for unloading models to an MMAP-backed disk file instead of CPU memory. By offloading weights to disk, it prevents CPU OOM conditions and avoids ComfyUI crashes during model unloading.

MMAP-backed tensors can be moved to GPU using the standard to("cuda") operation, making them straightforward to reload.

Usage

python main.py --offload-reserve-ram-gb 5

If the available CPU memory is below this threshold (e.g., 5 GB), ComfyUI will offload model weights to an MMAP (disk-based) file rather than CPU RAM during unloading, thereby avoiding CPU memory exhaustion.

@asagi4
Copy link
Contributor

asagi4 commented Dec 12, 2025

Would it not be better to just drop the memory and reload tensors from the model file on disk that you already have? This looks like a manual implementation of a swap file; I don't think it's really helpful to write things to the disk. If you're going to do that, it will be easier and probably more efficient to just add swap space.

I would like it if ComfyUI had the ability to drop tensors from memory when RAM runs low and reload them on demand, but if it involves disk writes, it's not going to be more efficient than OS swapping.

I guess if you could load a checkpoint using mmap in a way that allows you to "return" mmaped memory to the disk (optimally letting the OS decide if it's actually in memory or on disk), that would solve the problem pretty neatly.

@rattus128
Copy link
Contributor

Agree. This is the same behaviour as swap. I actually have some work-in-progress for what @asagi4 describes by just going back to the model file in this scenario.

@jovan2009
Copy link

jovan2009 commented Dec 12, 2025

Agree. This is the same behaviour as swap. I actually have some work-in-progress for what @asagi4 describes by just going back to the model file in this scenario.

If I may interject. There is also the scenario in which the model safetensors are on a disk slower than the disk where the swap file is. For example I keep the swap file on an NVMe SSD while the model is on a SATA SSD. I prefer to reload the data from the swap file (NVMe speed) than from the model (SATA speed). Not to mention that keeping all the data in "RAM" (meaning actual RAM + swap) I think allows Windows to use memory compression on what it can apply it (for example my compressed area reaches about 17+ GB in wan 2.2 workflows).

@rattus128
Copy link
Contributor

rattus128 commented Dec 12, 2025

Agree. This is the same behaviour as swap. I actually have some work-in-progress for what @asagi4 describes by just going back to the model file in this scenario.

If I may interject. There is also the scenario in which the model safetensors are on a disk slower than the disk where the swap file is. For example I keep the swap file on an NVMe SSD while the model is on a SATA SSD. I prefer to reload the data from the swap file (NVMe speed) than from the model (SATA speed). Not to mention that keeping all the data in "RAM" (meaning actual RAM + swap) I think allows Windows to use memory compression on what it can apply it (for example my compressed area reaches about 17+ GB in wan 2.2 workflows).

Yeah i've thought about this case too. I think this setup is however the exception rather than the rule. The majority of users will have either no swap or a same-disk swap and have a lot to gain by just ditching that write-on-unload completely. Note that even if this change was made your use case can still be handled by configuring your NVME as a vanilla disk cache on top of your model library.

@jovan2009
Copy link

Note that even if this change was made your use case can still be handled by configuring your NVME as a vanilla disk cache on top of your model library.

This is something I'm unaware how can be done. Can it be done now or it will be a starting argument in the future?

Anyway, even if I can assign a ComfyUI model cache to the fastest drive it will still be a cache separated from windows swap. Maybe I'm wrong or stubborn but I would like to still have the option to let windows manage its "continuous" RAM + swap space if the future modifications don't prove themselves being beneficial for my case. I bought the NVME for the single purpose to make it some sort of "RAM expansion", it was the best I could think of after I maxed out what my motherboard can accommodate (64 GB RAM).

@RandomGitUser321
Copy link
Contributor

RandomGitUser321 commented Dec 12, 2025

Just use --cache-ram x.x. It defaults to 4.0 I think. Let's say you have 32gb ram, 4.0 would mean if offloading were going to cause it to exceed 28gb, it will dump cached things with a priority and then you'll just reload them if they are needed again. What you're trying to implement, like others have stated, is essentially yet another page file. This means more wear on SSDs and would likely be much slower than just reloading a model again and that's counting it doing all the initial processing of loading the model weights.

I bought the NVME for the single purpose to make it some sort of "RAM expansion"

And you'll see your drive's health rapidly decline when you're writing 10s of gigabytes constantly.

@strint strint marked this pull request as draft December 15, 2025 06:58
@strint
Copy link
Author

strint commented Dec 15, 2025

Would it not be better to just drop the memory and reload tensors from the model file on disk that you already have?

ComfyUI’s unload and load operations work on the parameters and buffers of an nn.Module, and typically only a subset of these tensors is unloaded. Releasing and restoring selected tensors from disk is complex and affects many parts of the existing unload/load implementation.

In contrast, a tensor backed by an mmap storage behaves like a regular CPU tensor and can be moved to CPU or CUDA easily using the standard Tensor.to operation. This simplicity is the primary reason for using mmap-backed tensors.

This looks like a manual implementation of a swap file; I don't think it's really helpful to write things to the disk. If you're going to do that, it will be easier and probably more efficient to just add swap space.

You are right that both mmap and swap operate on VM pages, but swap is managed by the OS and only handles anonymous memory.

In ComfyUI, most model weights are file-backed or explicitly managed tensors. By unloading them manually into a dedicated mmap file, we can control what gets evicted and when, instead of relying on global OS heuristics across all processes.

This becomes important because CPU OOM often crashes the entire ComfyUI process. The mmap offload provides a predictable, controlled way to prevent such crashes, which swap alone may not reliably avoid in these scenarios.

When CPU RAM approaches its limit, a GPU tensor will be offloaded to a dedicated mmap file on disk instead of causing an OOM in CPU or GPU memory.

@asagi4 @rattus128

@strint
Copy link
Author

strint commented Dec 15, 2025

Agree. This is the same behaviour as swap. I actually have some work-in-progress for what @asagi4 describes by just going back to the model file in this scenario.

Going back to the original model file is indeed the most straightforward approach for full unloads.

However, when only part of the model’s parameters or buffers need to be unloaded, selectively loading them back from model file will become complicated?

@strint strint marked this pull request as ready for review December 15, 2025 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants