Description
What API design would you like to have changed or added to the library? Why?
Most people expect diffusers
and transformers
"models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."
In other words, author a mixin that keeps track of all weights in Hugging Face hierarchy objects loaded onto the GPU; and when forward
is called on any Hugging Face hierarchy object, moves weights in other objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.
What use case would this enable or better enable? Can you give us a code example?
The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.
Separately, it would eliminate the main source of toil for people who integrate diffusers
into other products like ComfyUI.