global, eager model weight GPU unloading

**What API design would you like to have changed or added to the library? Why?**

Most people expect `diffusers` and `transformers` "models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."

In other words, author a mixin that keeps track of *all* weights in Hugging Face hierarchy objects loaded onto the GPU; and when `forward` is called on *any* Hugging Face hierarchy object, moves weights in *other* objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.

**What use case would this enable or better enable? Can you give us a code example?**

The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.

Separately, it would eliminate the main source of toil for people who integrate `diffusers` into other products like ComfyUI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

global, eager model weight GPU unloading #8605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

global, eager model weight GPU unloading #8605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions