Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatible with 8GB VRAM? #17

Open
MarsEverythingTech opened this issue Dec 4, 2024 · 22 comments
Open

Compatible with 8GB VRAM? #17

MarsEverythingTech opened this issue Dec 4, 2024 · 22 comments

Comments

@MarsEverythingTech
Copy link

Hello,

I have RTX 3070 Ti 8GB VRAM GPU, and am wondering if It can run it.

Thanks in advance.

@Omarsmsm
Copy link

Omarsmsm commented Dec 4, 2024

i doubt, but the q2 gguf that someone will drop in a few weeks will run on a electric toothbrush soon enough

@kijai
Copy link
Owner

kijai commented Dec 4, 2024

If you have enough RAM, maybe with the block_swap feature and very low resolution and short clips. The model can do working clip even at something like 336x192

@MarsEverythingTech
Copy link
Author

If you have enough RAM, maybe with the block_swap feature and very low resolution and short clips. The model can do working clip even at something like 336x192

I have 16GB RAM.

@kijai
Copy link
Owner

kijai commented Dec 5, 2024

If you have enough RAM, maybe with the block_swap feature and very low resolution and short clips. The model can do working clip even at something like 336x192

I have 16GB RAM.

I'm afraid that's not really going to cut it then.

@nitinmukesh
Copy link

nitinmukesh commented Dec 5, 2024

Hello Kijai,

I am getting OOM error while using this workflow. Is there any setting in the workflow or it has to do with Comfyi

Comfyi is not using shared memory and gives OOM once VRAM is full. How to make it use shared memory

@dasilva333
Copy link

dasilva333 commented Dec 6, 2024

hey guys I am here with all the answers let me just say upfront that yes 8GB will 100% work with Hunyuan video a few notes about compatibility
@MarsEverythingTech @nitinmukesh @Omarsmsm @kijai

  • The last working version I'm personally using that works for me is this commit:
    e834402
  • If Kijai wants I can try to step forward one commit at a time retesting it to find what commit broke compatibility
  • Basically you know you're on the right version if the TextEncoder node (not the load one) has the extra fields for the CLIP model (clip_prompt and clip_negative_prompt)
  • HunyuanVideo for me supports 'full' resolution using my 8GB GPU for only 13 frames that is 512x320 @ 13 frames
  • If you're willing to go smaller I can get 45 frames at 256x160 and then set it up to 15 fps I get a decent 3 seconds out of it

My settings are as follows:
[model node]
fp8 model
fp16 precision
offload_device
cpu offload true
flash_attn

[load text encoder node]
fp16

[sampler node]
256 width
160 height
45 frames
50 steps
4 guidance
force offload true

[vae loader]
bf16 model with fp16 precision

My 4080 can generate a video in about 420-450 seconds which is not bad Ive even seen cases where it gets down to the 300s but I'm talking averages. Also it goes without saying but please make sure you're starting ComfyUI in lowvram mode by editing the .bat file to include the flag --lowvram

@nitinmukesh
Copy link

@dasilva333 Thank you.

@kozer
Copy link

kozer commented Dec 7, 2024

dasilva333 how much RAM does the model used in your case? I have 32GB but it runs out of memory trying to load llama2-textencoder. You have more than 32GB ram? ( Not VRAM, but RAM)

@JamesIV4
Copy link

JamesIV4 commented Dec 7, 2024

hey guys I am here with all the answers let me just say upfront that yes 8GB will 100% work with Hunyuan video

@dasilva333 Which ComfyUI workflow are you using?

I got it working, I had to turn off "--lowvram" to get it to work on my 12 GB 2060, using the commit hash you mentioned and using an older ComfyUI workflow from around when you made your post. I think the main difference is the setting "load device". It's set to "main_device" on the older workflows, and on the newer ones it's on "offload_device". I think that's it at least, need to test more.

@RhaoG
Copy link

RhaoG commented Dec 8, 2024

thanks @dasilva333

can confirm this works on a 4070 8gb vram laptop with 64 gb ram on latest version.
{
commit 6655642 (HEAD -> main, origin/main, origin/HEAD)
Author: kijai 40791699+kijai@users.noreply.github.com
Date: Fri Dec 6 18:58:02 2024 +0200
}

the following settings:

vae on fp16
model on fp16 base precision, fp8 quantization, main device load device, sageattn
text encoder on fp16 precision, quantization nf4

160*256, 45 frames, 50 steps takes about less than 400 secs

still testing the main device part on the model - seems like this needs switching in between generations (or not?).

also when it comes to decode, you kinda have to press the queue again to let it proceed.

@dasilva333
Copy link

dasilva333 how much RAM does the model used in your case? I have 32GB but it runs out of memory trying to load llama2-textencoder. You have more than 32GB ram? ( Not VRAM, but RAM)

I have 32GB of ram, like I said you MUST use the exact commit provided in my message, that is if you do a pull you must then run an additional commit to check out the specific version provided, I think I also saw in my personal experience that it runs out of memory loading the LLM with the newer versions.

@JamesIV4 Yeah I'm using the older workflow that was part of the commit at that time. Im also going to test it myself because i'm using offload_device with the older commit and when set to main_device it doesn't work for me.

@RhaoG
Great i'm glad it worked for you too, it's important to use fp16 over bf16 bc I observed a drastic reduction in quality when using bf16 over fp16.

Fun Side Fact: With further testing since my last post I managed to squeeze 4 seconds out of my GPU that is 160x256 @ 53 frames & 12 fps and 50 steps

@dasilva333
Copy link

dasilva333 commented Dec 8, 2024

I just did a git pull on the latest commit to see if the latest version is still broken and I can happily confirm the latest version works perfectly many thanks to @kijai and his amazing work.

for the record his latest commit at the time of his post is this one: 9a4abbc

my settings are as follows:
[model loader]
fp8 model
fp16 precision
offload_device
flash_attn_varlen

[textencoder loader]
fp16
2
bnb_nf4 (it also worked set to disabled)

[sampler]
256
160
53
50
6
9

[vae loader]
bf16 model
fp16 precision

[text encode node]
offload_model - true

on the old commit it usually took about 8 to 9 minutes to make the 53 frames
on the new commit it's now taking 48 minutes which seems like an extraordinarily long time,

if @kijai wants to suggest any settings to try instead of the ones provided above I could take some time to try them out, I might just end up going back to the old commit as I see no benefits of staying in the latest one given there's no new features and significantly slower performance. I'll keep testing it and provide any updates if I can get it to return to the faster performance metrics of the older commits.

Update:
Setting quant to Disabled works better and faster for me
also I added the block swap args into the input for the model loader with the default settings of 20/0
Now I'm getting 7-8 minute videos again using 53 frames/50 steps but it's not as coherent as before. I wonder what the flow shift was being set to internally prior to it being exposed as a field. I'm going to try changing it from 9 to 3 to see if it improves making coherent videos like the older commit.

Update 2:
setting it from offload_device to main_device causes ComfyUI to crash 100% of the time. I'm also observing 140-160 second times for subsequent runs

@JamesIV4
Copy link

JamesIV4 commented Dec 9, 2024

@dasilva333 that's great news! I thought the recent commits looked similar. Thanks for confirming!

@RhaoG
Copy link

RhaoG commented Dec 9, 2024

29 frames at 512x360 works with the block swap arg. (output coherence is kind of questionable though)
13 frames at 512x360 with STG-R (compatible with sageattention) works with the block swap arg and torch compile.

  • getting longer outputs now with the added nodes.
  • STG-R improves quality but i guess takes up some memory too hence it doesn't run with the 29 frames. (on model offload device)
  • also not sure about coherence problems with block swap args and increased frames.

Update:
STG-R on 29 frames worked with main device (instead of offload_device)

@kijai
Copy link
Owner

kijai commented Dec 9, 2024

29 frames at 512x360 works with the block swap arg. (output coherence is kind of questionable though)
13 frames at 512x360 with STG-R (compatible with sageattention) works with the block swap arg and torch compile.

  • getting longer outputs now with the added nodes.
  • STG-R improves quality but i guess takes up some memory too hence it doesn't run with the 29 frames. (on model offload device)
  • also not sure about coherence problems with block swap args and increased frames.

Update:
STG-R on 29 frames worked with main device (instead of offload_device)

Block swapping etc. Doesn't affect the output quality itself, just the sampling speed and memory use. STG does increase memory use and slow down the process, this is why introduced the start/end percent setting for it, usually running it for just few steps can give most of the benefit without slowing the whole thing down too much. I don't know what's the best block to choose yet though.

Resolution and frame count definitely affect the motion quality as well, the model is much better the higher you can go. Other than that, mostly it's about prompt and honesty just luck with seed.

@kozer
Copy link

kozer commented Dec 9, 2024

I just did a git pull on the latest commit to see if the latest version is still broken and I can happily confirm the latest version works perfectly many thanks to @kijai and his amazing work.

for the record his latest commit at the time of his post is this one: 9a4abbc

my settings are as follows: [model loader] fp8 model fp16 precision offload_device flash_attn_varlen

[textencoder loader] fp16 2 bnb_nf4 (it also worked set to disabled)

[sampler] 256 160 53 50 6 9

[vae loader] bf16 model fp16 precision

[text encode node] offload_model - true

on the old commit it usually took about 8 to 9 minutes to make the 53 frames on the new commit it's now taking 48 minutes which seems like an extraordinarily long time,

if @kijai wants to suggest any settings to try instead of the ones provided above I could take some time to try them out, I might just end up going back to the old commit as I see no benefits of staying in the latest one given there's no new features and significantly slower performance. I'll keep testing it and provide any updates if I can get it to return to the faster performance metrics of the older commits.

Update: Setting quant to Disabled works better and faster for me also I added the block swap args into the input for the model loader with the default settings of 20/0 Now I'm getting 7-8 minute videos again using 53 frames/50 steps but it's not as coherent as before. I wonder what the flow shift was being set to internally prior to it being exposed as a field. I'm going to try changing it from 9 to 3 to see if it improves making coherent videos like the older commit.

Update 2: setting it from offload_device to main_device causes ComfyUI to crash 100% of the time. I'm also observing 140-160 second times for subsequent runs

for me unfortunatelly is not working... Both the latest, and the previous one. It seems that either clip-vit-large-patch14 or llava-llama-3-8b-text-encoder-tokenizer doesnt fit in my RAM. And the fun thing is that I have a 4090. :D
I'm using linux btw. Does anyone use something else?
cc: @dasilva333 , any proposals? I have tried your settings etc, with no luck.

@kijai
Copy link
Owner

kijai commented Dec 9, 2024

I just did a git pull on the latest commit to see if the latest version is still broken and I can happily confirm the latest version works perfectly many thanks to @kijai and his amazing work.

for the record his latest commit at the time of his post is this one: 9a4abbc

my settings are as follows: [model loader] fp8 model fp16 precision offload_device flash_attn_varlen

[textencoder loader] fp16 2 bnb_nf4 (it also worked set to disabled)

[sampler] 256 160 53 50 6 9

[vae loader] bf16 model fp16 precision

[text encode node] offload_model - true

on the old commit it usually took about 8 to 9 minutes to make the 53 frames on the new commit it's now taking 48 minutes which seems like an extraordinarily long time,

if @kijai wants to suggest any settings to try instead of the ones provided above I could take some time to try them out, I might just end up going back to the old commit as I see no benefits of staying in the latest one given there's no new features and significantly slower performance. I'll keep testing it and provide any updates if I can get it to return to the faster performance metrics of the older commits.

Update: Setting quant to Disabled works better and faster for me also I added the block swap args into the input for the model loader with the default settings of 20/0 Now I'm getting 7-8 minute videos again using 53 frames/50 steps but it's not as coherent as before. I wonder what the flow shift was being set to internally prior to it being exposed as a field. I'm going to try changing it from 9 to 3 to see if it improves making coherent videos like the older commit.

Update 2: setting it from offload_device to main_device causes ComfyUI to crash 100% of the time. I'm also observing 140-160 second times for subsequent runs

for me unfortunatelly is not working... Both the latest, and the previous one. It seems that either clip-vit-large-patch14 or llava-llama-3-8b-text-encoder-tokenizer doesnt fit in my RAM. And the fun thing is that I have a 4090. :D
I'm using linux btw. Does anyone use something else?
cc: @dasilva333 , any proposals? I have tried your settings etc, with no luck.

It does take a lot of RAM, did you try using the bitsandbytes NF4 quant for the text encoder? Also in latest updates I added ability to use comfy clip model instead of the original, that's some memory saved as well, to use that you'd disable the clip_l in the text encoder loader and plug in normal comfy clip loader with clip_l model selected.

@RhaoG
Copy link

RhaoG commented Dec 9, 2024

surprisingly with the swap blocks, i also was able to get 544 by 960 resolution to run with a short prompt and still at 29 frames with STG. OOM seems so random with this model. (about 959 secs)

@JamesIV4
Copy link

JamesIV4 commented Dec 9, 2024

@kijai I got everything working except for OOM issues on the latest version. I did not have OOM issues on commit e834402.

I have 16 GB of RAM and 12 GB of VRAM.

I noticed a couple things on the latest commit:

  • Load device set to "offload_device" will only fill up shared memory for the GPU (which is 8 GB on my PC) leading to a very quick OOM. The 12 GB of dedicated memory isn't used at all.
  • Load device set to "main_device" fills the dedicated VRAM first and then dips into the shared memory.
  • However, once inference starts (the steps are beginning), memory usage drops off and then starts filling up ONLY in the shared memory again. Leading to another OOM.

This is with nf4 on, sageattention on, and torch compiler settings on, and only running a 160x256 video generation for 17 frames. That should definitely be doable. (Note my screenshot doesn't show the torch compiler connected, but I tried many times with it on too).

I think the code might need to be tweaked to make sure inference loads on the dedicated VRAM first and then goes to the shared memory.

Screenshot of that last scenario here. Note the Task Manager showing the shared memory usage is the only one filling up when the steps started:
Screenshot 2024-12-09 162401

@dasilva333
Copy link

@RhaoG do you mind posting your workflow where you got STG to work? I'd love to try it, I can't seem to get any coherent prompts at all with the latest version

@Johnreidsilver
Copy link

The current champion of low VRAM is LTX video. I can run the example workflow with STG on 6GB VRAM laptop 3060 and 32GB RAM. Don't know how they compare in final quality, but higher resolution is doable for same VRAM levels.

@RhaoG
Copy link

RhaoG commented Dec 10, 2024

@RhaoG do you mind posting your workflow where you got STG to work? I'd love to try it, I can't seem to get any coherent prompts at all with the latest version

workflow

@dasilva333 here you go. i don't understand the torch compile node though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants