-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatible with 8GB VRAM? #17
Comments
i doubt, but the q2 gguf that someone will drop in a few weeks will run on a electric toothbrush soon enough |
If you have enough RAM, maybe with the block_swap feature and very low resolution and short clips. The model can do working clip even at something like 336x192 |
I have 16GB RAM. |
I'm afraid that's not really going to cut it then. |
Hello Kijai, I am getting OOM error while using this workflow. Is there any setting in the workflow or it has to do with Comfyi Comfyi is not using shared memory and gives OOM once VRAM is full. How to make it use shared memory |
hey guys I am here with all the answers let me just say upfront that yes 8GB will 100% work with Hunyuan video a few notes about compatibility
My settings are as follows: [load text encoder node] [sampler node] [vae loader] My 4080 can generate a video in about 420-450 seconds which is not bad Ive even seen cases where it gets down to the 300s but I'm talking averages. Also it goes without saying but please make sure you're starting ComfyUI in lowvram mode by editing the .bat file to include the flag --lowvram |
@dasilva333 Thank you. |
dasilva333 how much RAM does the model used in your case? I have 32GB but it runs out of memory trying to load llama2-textencoder. You have more than 32GB ram? ( Not VRAM, but RAM) |
@dasilva333 Which ComfyUI workflow are you using? I got it working, I had to turn off "--lowvram" to get it to work on my 12 GB 2060, using the commit hash you mentioned and using an older ComfyUI workflow from around when you made your post. I think the main difference is the setting "load device". It's set to "main_device" on the older workflows, and on the newer ones it's on "offload_device". I think that's it at least, need to test more. |
thanks @dasilva333 can confirm this works on a 4070 8gb vram laptop with 64 gb ram on latest version. the following settings: vae on fp16 160*256, 45 frames, 50 steps takes about less than 400 secs still testing the main device part on the model - seems like this needs switching in between generations (or not?). also when it comes to decode, you kinda have to press the queue again to let it proceed. |
I have 32GB of ram, like I said you MUST use the exact commit provided in my message, that is if you do a pull you must then run an additional commit to check out the specific version provided, I think I also saw in my personal experience that it runs out of memory loading the LLM with the newer versions. @JamesIV4 Yeah I'm using the older workflow that was part of the commit at that time. Im also going to test it myself because i'm using offload_device with the older commit and when set to main_device it doesn't work for me. @RhaoG Fun Side Fact: With further testing since my last post I managed to squeeze 4 seconds out of my GPU that is 160x256 @ 53 frames & 12 fps and 50 steps |
I just did a git pull on the latest commit to see if the latest version is still broken and I can happily confirm the latest version works perfectly many thanks to @kijai and his amazing work. for the record his latest commit at the time of his post is this one: 9a4abbc my settings are as follows: [textencoder loader] [sampler] [vae loader] [text encode node] on the old commit it usually took about 8 to 9 minutes to make the 53 frames if @kijai wants to suggest any settings to try instead of the ones provided above I could take some time to try them out, I might just end up going back to the old commit as I see no benefits of staying in the latest one given there's no new features and significantly slower performance. I'll keep testing it and provide any updates if I can get it to return to the faster performance metrics of the older commits. Update: Update 2: |
@dasilva333 that's great news! I thought the recent commits looked similar. Thanks for confirming! |
29 frames at 512x360 works with the block swap arg. (output coherence is kind of questionable though)
Update: |
Block swapping etc. Doesn't affect the output quality itself, just the sampling speed and memory use. STG does increase memory use and slow down the process, this is why introduced the start/end percent setting for it, usually running it for just few steps can give most of the benefit without slowing the whole thing down too much. I don't know what's the best block to choose yet though. Resolution and frame count definitely affect the motion quality as well, the model is much better the higher you can go. Other than that, mostly it's about prompt and honesty just luck with seed. |
for me unfortunatelly is not working... Both the latest, and the previous one. It seems that either |
It does take a lot of RAM, did you try using the bitsandbytes NF4 quant for the text encoder? Also in latest updates I added ability to use comfy clip model instead of the original, that's some memory saved as well, to use that you'd disable the clip_l in the text encoder loader and plug in normal comfy clip loader with clip_l model selected. |
surprisingly with the swap blocks, i also was able to get 544 by 960 resolution to run with a short prompt and still at 29 frames with STG. OOM seems so random with this model. (about 959 secs) |
@kijai I got everything working except for OOM issues on the latest version. I did not have OOM issues on commit e834402. I have 16 GB of RAM and 12 GB of VRAM. I noticed a couple things on the latest commit:
This is with nf4 on, sageattention on, and torch compiler settings on, and only running a 160x256 video generation for 17 frames. That should definitely be doable. (Note my screenshot doesn't show the torch compiler connected, but I tried many times with it on too). I think the code might need to be tweaked to make sure inference loads on the dedicated VRAM first and then goes to the shared memory. Screenshot of that last scenario here. Note the Task Manager showing the shared memory usage is the only one filling up when the steps started: |
@RhaoG do you mind posting your workflow where you got STG to work? I'd love to try it, I can't seem to get any coherent prompts at all with the latest version |
The current champion of low VRAM is LTX video. I can run the example workflow with STG on 6GB VRAM laptop 3060 and 32GB RAM. Don't know how they compare in final quality, but higher resolution is doable for same VRAM levels. |
@dasilva333 here you go. i don't understand the torch compile node though |
Hello,
I have RTX 3070 Ti 8GB VRAM GPU, and am wondering if It can run it.
Thanks in advance.
The text was updated successfully, but these errors were encountered: