Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Stable Diffusion 3.5 Large #2574

Open
super-fun-surf opened this issue Oct 25, 2024 · 7 comments
Open

Support for Stable Diffusion 3.5 Large #2574

super-fun-surf opened this issue Oct 25, 2024 · 7 comments

Comments

@super-fun-surf
Copy link

I tried updating the hf repo to 3.5 Large but its not working.

Error: cannot find tensor text_encoders.clip_l.transformer.text_model.embeddings.token_embedding.weight
@LaurentMazare
Copy link
Collaborator

See #2578

@super-fun-surf
Copy link
Author

This is working on A100. Takes too much memory for RTX4000 with 20GB.

I see there is a quantized gguf
https://huggingface.co/city96/stable-diffusion-3.5-large-gguf

is it possible currently to use this gguf quantized model?
also is it possible to use safetensors style quantized models..

thanks.

@super-fun-surf
Copy link
Author

I see in the readme of SD3 there is a benchmark running a RTX 3090 Ti. how much memory does that card have. seams like 35 takes 40 plus GB to run in candle...

@LaurentMazare
Copy link
Collaborator

I made a few tweaks in #2581 and #2582 and with that it seems to use 20.9GB of memory, fwiw a 3090 Ti has 24GB so it should run there.

@super-fun-surf
Copy link
Author

super-fun-surf commented Oct 30, 2024

great work.
its running so great on the A100.
there seams to be a memory spike when it loads the T5 into F32 which pushes it over the limit for the 20GB RTX4000 on my desktop.
I got the nsys profiler working and I am learning how to track a bit.

@LaurentMazare
Copy link
Collaborator

I've pushed some further changes in #2589 so that the f32 conversion is done on the flight rather than upfront so that we can benefit from the reduced memory usage while retaining full precision. After this, the memory usage I get from nsys during the text encoding step is done to ~10.5GB. That said, I still see the memory usage getting to ~20GB while running the mmdit so not that likely to fit on a 20GB gpu.

@super-fun-surf
Copy link
Author

Amazing how much space is saved with the F16 on the T5. And it's only using 17GB during the sampling!
And im happy to report that it is working on RTX4000 with with 20GB.
Sampling done. 28 steps. 79.60s. Average rate: 0.35 iter/s
It's also working on M3. though very slow. Same image takes around 20 Minutes on M3 with 36GB
Rad work! thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants