Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable CFG? #119

Open
tdrussell opened this issue Dec 11, 2024 · 9 comments
Open

Re-enable CFG? #119

tdrussell opened this issue Dec 11, 2024 · 9 comments

Comments

@tdrussell
Copy link
Contributor

I uncommented some of the code for CFG and negative prompts and have been playing around with it a bit. This model works just fine with CFG, as long as embedded_guidance_scale is 1, or a low value. Both in general use, and especially for loras, it can improve prompt adherence. I think there's no reason to have CFG disabled. The model works like Flux dev, you can use CFG as long as you have the settings right.

Also, with CFG and negative prompting it would be nice to have an option to compute the positive and negative noise predictions separately, rather than batched together. It might take a bit longer but it wouldn't use more VRAM that way, I think. But this could be done later.

@kijai
Copy link
Owner

kijai commented Dec 11, 2024

I did try it at the start (by accident mostly) and results were underwhelming then, and it's twice as slow to sample. There's 2 different model configs, this one being loaded as cfg-distlled by default, so while you can use cfg it's not really meant to be used like that, thus their demo also uses cfg 1.0 thus disabling it by default.

What I'm now wondering is how does the LLM deal with negative prompts anyway? There's a default template that wouldn't make much sense for negatives, are you using the video prompt template, something custom or none at all?

Of course optional node wouldn't hurt anyone (besides me for having to maintain that whole sampling path).

@tdrussell
Copy link
Contributor Author

I'm using the video template. I think it makes sense for negatives, as the negative noise prediction process is exactly the same as for the positive, just with a different prompt. I think the reason for the templates in the first place is to just "prime" the text encoder. It's an instruction tuned model, so with that template prefix, the LLM "thinks" it's predicting text to describe an image or video in detail, so the hidden states would include richer information that would make predicting the next word easier. But then those hidden states just become the text embeddings. That's why they do it like that.

Also I would point out that when CFG=1, the code doesn't compute negative noise predictions at all, so it's just as fast. I'm seeing much better results with loras in some cases with something like CFG=4 and embedded_guidance_scale=2, so I personally would like the option. I could always just keep the changes locally but others might want it too. I do understand it might be a bit confusing for users if the default is to have CFG=1, but there's a negative prompt text box. I don't know if there's a good way to address that.

@kijai
Copy link
Owner

kijai commented Dec 11, 2024

Considering the amount of issues and confusion already, that's exactly why I wanted to simplify it in the end. But it could be implemented as optional extra node or something.

Did you try STG though? I didn't really find the best settings for it, but some people posted quite promising comparisons. It has the same issue with speed though, so I added time step scheduling for it to just run it for few steps and it seemed to give most of the benefit, additional issue with it is increased memory use, so ultimately most people would need to run it with block_swapping, which then also should be scheduled...

Cfg scheduling might also make sense then.

I do understand your point about LoRas.

@tdrussell
Copy link
Contributor Author

Making it two nodes sounds like it would help with confusion. You could make them share a common flow in the code (there's very few differences between CFG and not). I haven't tried STG yet but I'll give it a shot.

For what it's worth, what I'm seeing with loras and CFG I've also seen with Flux dev. Prompt adherence and perceived "strength" of the lora is much better with CFG, but with flux you also have to use one of those anti burn techniques. Luckily here you don't need anything special. For some reason HunyuanVideo is burning much less for me than flux with CFG.

@sd2530615
Copy link

what does STG-A and STG-R does tho

@4lt3r3go
Copy link

4lt3r3go commented Dec 12, 2024

I really appreciate this CFG topic and the "making it two nodes option"
please let's not implement this directly in the main node, resulting in general slows down of the whole thing.
better to have is as an optional feature that can be loaded separatly

@kijai
Copy link
Owner

kijai commented Dec 12, 2024

I really appreciate this CFG topic and the "making it two nodes option" please let's not implement this directly in the main node, resulting in general slows down of the whole thing. better to have is as an optional feature (like in Flux) that can be loaded separatly

What I'm thinking of is just making it extra node such as STG is now, wouldn't change anything to the user who doesn't use it.

@sd2530615
Copy link

does CFG work now i mean isnt cfg beeing distilled . i saw you addbing back cfg option

I really appreciate this CFG topic and the "making it two nodes option" please let's not implement this directly in the main node, resulting in general slows down of the whole thing. better to have is as an optional feature (like in Flux) that can be loaded separatly

What I'm thinking of is just making it extra node such as STG is now, wouldn't change anything to the user who doesn't use it.

@4lt3r3go
Copy link

4lt3r3go commented Dec 14, 2024

image

I have tested this new module a lot, and although I'm not entirely sure its use is meant for the negative prompt purpose I assume that's the use, even though it's not directly specified.
what I've discovered is that using these values (image) with a very low final percentage, the additional inference time is negligible since the effect stops immediately at 0,01. Yet, the effect is still noticeably visible (at least when generating small videos, which is what I do to quickly test the model and then do vid2vid at higher resolutions). Another thing I've noticed is a sort of Bloom/Haze effect when the module is active; everything looks a bit blurred and tends towards the light side. However, I haven't figured out yet if this depends on the settings or what is written in the cfg window

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants