-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Add new way to manage timestep distribution and new loss offset feature #1375
Comments
This looks like it could pair really well with an approach that I'm working on and about to publish (the short version is "synchronoise noise std/mean to the underlying model"), which also dramatically improves convergence speed. I'm going to see if I can adapt this approach into my approach, because I have a theory that my approach will flatten the curves you're seeing there, which could potentially mean that the two approaches could cause faster convergence together. Thanks! |
We actually were following your development on new loss approach and it's weighting for last couple days, would love to see that incorporated with your really fancy tech! :D |
Can you elaborate a bit on the underlying theory of operation of your approach? A quick read through seemed to suggest that you're just tracking losses per-timestep, then weighting timestep selection towards higher-loss timesteps over the course of the run. Do I have that roughly correct? |
Yes, that's the mechanism. We initialize a Loss Map that contains each timestep and it's loss, we initialize it with equal loss of 1 for each, but that can be changed to emulate different distributions at start, like lognorm(0,1). Then we adjust that distribution by a small margin by dampening high chance timesteps and boosting low chance timesteps, to still give them meaningful chance to be sampled once in a while. That's the mechanism, so you are correct. |
Different datasets led to meaningfully different graphs, so i think dynamic way of approaching this is currently best |
Are you using min_snr in your examples there? Based on my experiments, I would not expect that sharp downwards trend in loss observations in the earliest timesteps, unless you're using a mechanism that is artificially dampening it. Loss should be highest in the earliest timesteps (particular for SD1.5) as that's where the true noise/predicted noise standard deviation is the highest. You might be able to simplify things by just using a gaussian kernel over the raw observations, which would give you configurable bandwidth for tuning the resolution/smoothness of the histogram. |
I indeed do use min snr of 8 in training. I don't want to really drop it, but in hindsight it would be better to move Im not sure simplifying would be the goal here, since i wasn't aiming at receiving smooth timestep distribution, but instead at a balance, where we still could see meaningful differences in sampling chance between relatively close timesteps, if they indeed reliably produce higher loss consistently. When i was testing with some amount of Huber loss mixed in, trend was more even, just as a note. I should place saving of loss for timesteps before min snr and see how it behaves with that. |
Was this an SD1.5 model you generated the graphs on? Absent the min_snr depression, this looks an awful lot like my observed std() distributions from my latest post in the loss thread, which makes me wonder if I could just use something like: prob = torch.nn.functional.softmax(-std_by_ts.mean(dim=1).reshape(-1), dim=0)
cat = torch.distributions.Categorical(probs=prob.float())
timesteps = cat.sample([b_size]).to(device=latents.device) From a quick first test, it seems to be VERY promising. I'll run some more comprehensive tests, but it might be that the std discrepancy is accounting for enough of the overall loss that we could just sample timestamps with it directly. Edit: This is my result on my SD1.5 test harness after only 20 epochs. I'd normally need to get to 40-50 epochs to get this kind of result before. This isn't dynamically learning the loss map like you are, but might be able to jump right to a good-enough approximation using the std measurements from the underlying model. It's worth noting that if you're using a mechanism like this to compensate for the loss distribution imbalance, min_snr becomes useless-to-harmful. My example there is without any kind of min_snr, noise offset, or otherwise, and it seems to be working splendidly. |
Fair enough! Suffice it to say that both model quality and fidelity are way better than in previous approaches. I'll need to generate stats for ponyv6. It's a very different model than most and is probably going to exhibit some characteristics not found in other models. My sd1.5 result is conclusive enough that I think there's something very worth pursuing here! |
As you said, SDXL behaves very differently, in fact we got to the conclusion "SDXL is very rigid" even with extreme values it didn't reach unusable results like SD1.5, but we found that this approach still yielded better and faster results, in fact some stuff started to get more details when using lower LoRA Learning rate vs what we used to use with PonyV6, sadly we couldn't test much Finetunning due to only having X090 class hardware, but results on rented A6000 and A100 showed the same tendecy of "being better", but we lack a lot knowledge regarding ML in general, hopefully whatever you cook with this + your noise shenanigans ends up in something very very nice for the community. |
I think it's the way you spread the loss weights. Values at the edges only receive updates from one side. Maybe use a 1d convolution with padding set to "replicate" |
You mean let 999 be updated by 1 and vice versa? Might be an interesting idea. Tests from some people suggest that setting curve from the get go instead of letting it learn from uniform leads to worse performance in small trainings, so i think there is some grounding that needs to be done by learning earlier timesteps, and this likely will slightly increase hit at them. |
We were working some time on an way to create timestep distribution specific to any particular dataset used in training.
We tested it for last couple months using Derrian easy scripts trainer with hardcoded changes, but we're not familiar with sd-scripts structure enough to fully integrate that as toggleable feature on our own at the current point in time, hence feature request.
https://github.com/Anzhc/Timestep-Attention-and-other-shenanigans
Here we give code of functions structured roughly correctly for usage with variables in sd-scripts. Also examples how we implemented it in hardcoded way are included in repo (look into "Kohya file examples" folder).
It builds distribution based on received loss values, similar to this:
But unique to each dataset(as it's based on particular loss values received in scope of training).
In majority of cases it improved convergence speed and quality of outcome in our tests.
We also provide code for Loss Offset Curve feature that works with said distribution, can be scaled and scheduled. It also supposed to improve convergence speed.
Some experimental loss functions are included too, but those are really just experiments.
The text was updated successfully, but these errors were encountered: