Skip to content

Conversation

naripok
Copy link

@naripok naripok commented Nov 30, 2024

Hey!

I've trained a network for using with Flux1.D.

I've trained it using the pseudo-camera-10k dataset for 37k steps, fp16, resolution 512 and batch_size 1. (So it fits on my RTX3090).

It works somewhat fine:
image

Since I don't know exactly which slip of the COCO dataset you've used for validation, I can't add the stats to the readme.

I also have a bunch of questions, like why have you used scaling_factor = 0.13025 everywhere? Is this critical?

I've noticed that my upscaled images are a bit more blurry than the original ones before upscaling (without sampling, just vae encode -> upscale -> vae decode). When comparing with the results of your network for SDXL, the edges are a bit more blurry, but textures are a bit better. Not sure if the network is undertrained or what. Do you have any ideas?

Note: I know that there are unnecessary changes to the PR (formatting and whatnot). I was not expecting it to work so well, so I didn't take a lot of care while messing with it. If you want to take the contribution, I can clean it up before merging.

So, let me know what you think and thank you very much for the node and training code!

Cheers!

@Ttl
Copy link
Owner

Ttl commented Dec 1, 2024

I'm not opposed to merging this but the model should be good enough to be useful. There seems to be quite large artifacts in the resized image. How much denoising is needed for the second stage to get rid of them?

Scaling factor is for normalizing the latent standard deviation, it's not strictly necessary.

SDXL resizer quality was lower than SD1.5 because the VAE is trained more and it compresses more. I guess the same thing happens with Flux that it might be difficult for this small model to resize the latent effectively.

I have slightly better model locally. I didn't update it here because I didn't want to version the models. I pushed some training changes to dev branch. The biggest change was trying to make the model predict original latent from downsampled latent instead of the opposite. There were also some other loss function related changes, I don't think those were as important.

@naripok
Copy link
Author

naripok commented Dec 3, 2024

Hey, thank you for your answer! And sorry for the long delay on my side, I'm quite swamped with work these days. 🫠

There seems to be quite large artifacts in the resized image.

Actually, I think it is not that bad. The model was a bit undertrained, and after training it some more on the COCO dataset, it has improved some more. But even before, it was not that bad.

My previous screenshot was not very clear, so maybe you've looked at the image showing upscaling with bislerp?

This is how it looks now:
image

You can see it add some blurriness (scaled image is the left side in the image comparer node), but it doesn't destroy the image like when scaling with vanilla interpolation.

It's easier to see in a vid, I suppose:

shrooms.mp4

How much denoising is needed for the second stage to get rid of them?

Actually, I'm having a bit of a problem with this, and I have no idea of why. I was expecting that maybe you could have an answer for this. I actually can't get ridge of the blurriness by doing a second pass with low denoise... It makes no sense to me, but it is what happens.

If I upscale via interpolation and use a denoise of ~0.6 for 10 steps (euler, beta), I get a sharp image, albeit changed a bit. Now, if I scale via the NN, then try to denoise in the second pass, my output image comes out still blurry, even with a high denoise value ~0.6.

I have tried injecting noise back into the latent, with the hypothesis that, without the noise, the denoiser would have nothing to work with or something... But then I just end-up with a noisy blurry image... Even using stuff like detail daemon, lying sigma samples, or whatever, only gives me blurry noisy images...

Obviously, if I crank up the denoise, it will change the whole image and get me some sharp image back. But then there is no point...

Do you have any idea at all for why this would be happening? I'm up for doing some tests if you need some more information. This is driving me crazy lol

Scaling factor is for normalizing the latent standard deviation, it's not strictly necessary.

Ok, cool. In my training, I've used the value found in the flux vae's config, 0.3611. I was not sure why, but it seemed like a reasonable thing to do.

SDXL resizer quality was lower than SD1.5 because the VAE is trained more and it compresses more. I guess the same thing happens with Flux that it might be difficult for this small model to resize the latent effectively.

Would you recommend increasing the model param count? I suppose we'd just need some more data in order to scale training along? 🤔

I pushed some training changes to dev branch.

Nice! Thanks for letting me know. I'll try and train a new model using the changes and check if the results will improve.

@Ttl
Copy link
Owner

Ttl commented Dec 4, 2024

Slight blurring is probably unavoidable with this NN architecture. It would be easier for the neural net if it would only have to learn one or just few scaling factors. Superresolution training (predict high resolution from down scaled) slightly improves it, but doesn't completely get rid of it.

Getting rid of the blurriness is in my opinion caused by the diffusion model first predicting the low frequency large scale details first and then improving the details at later steps. If the latent is blurry at middle or late steps, it won't get rid of it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants