Skip to content

global, eager model weight GPU unloading #8605

Open
@doctorpangloss

Description

@doctorpangloss

What API design would you like to have changed or added to the library? Why?

Most people expect diffusers and transformers "models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."

In other words, author a mixin that keeps track of all weights in Hugging Face hierarchy objects loaded onto the GPU; and when forward is called on any Hugging Face hierarchy object, moves weights in other objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.

What use case would this enable or better enable? Can you give us a code example?

The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.

Separately, it would eliminate the main source of toil for people who integrate diffusers into other products like ComfyUI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions