Fix VRAM OOM for model upscaling flows #10684
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This will fix a particular VRAM oom for a range of workflows but in particular flows re-using a model for upscale.
See below for root cause and fix.
Example test case:
upscale_oom.json
WAN 128x128x181f > x8 upscale (1024x1024x181f) > Same WAN model
RTX5090
With this fix:
git commit message
In some workflows, its possible for a model to be used twice but with different requirements for the inference VRAM.
Currently, once a model is loaded at a certain level of offload, it will be preserved at that level of offload if it is used again. This will OOM if there is a major change in the size of the inference VRAM. This happens in your classic latent upscaling workflow where the same model is used twice to generate and upscale.
This is very noticable for WAN in particlar.
Fix by two-passing the model VRAM unload process, firstly trying with the existing list on idle models and then try again adding the actual models that are about to be loaded. This will implement the partial offload you need of your hot-in-VRAM model to make space for the bigger inference.
Improve info messages regarding any unloads done.