Everything you know about loss is a LIE! #294

AI-Casanova · 2023-03-14T12:51:15Z

AI-Casanova
Mar 14, 2023

I've been experimenting with different noising strategies, inspired in part by Noise Offset and Pyramid Noise.

This is the standard implementation of timesteps, which tells the noise scheduler how much of the noise to add to the latents.
timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (b_size,), device=latents.device)
A sampling from the Uniform distribution [0,1000)

But something very interesting happens when you replace those random timesteps with a constant value, your loss variability is almost none!

(Deterministic training at timestep intervals from [100-900], note the inverse exponential effect on loss)

Judging by our previous expectations of loss, very little training is expected to have occurred, but that is not the case.

(Timesteps 500 [center] and 600 are closest to my subject, with 200 coming in as a surprising third)

I'm still running tests to see what more I can glean from this, but in general I'm experiencing an unprecedented stability in training that I have a hard time explaining.

swfsql · 2023-03-29T13:57:05Z

swfsql
Mar 29, 2023

Could it be that the loss is actually "loss per timestep" or maybe the average of the loss for each timestep? I think this would be different from the loss accumulation for all timesteps (not sure if what I said even makes sense)

0 replies

AI-Casanova · 2023-03-29T14:08:26Z

AI-Casanova
Mar 29, 2023
Author

@swfsql it has to do with the signal to noise ratio, low timesteps have very little noise added into it, and when the trainer takes a sample (a single step down the timestep chain IIUC) to predict noise, it gets a lot larger error.

There's a lot more math involved, but basically you can think of it as 5/4 is a lot bigger that 500/499

There's some more info here #308 because Min-SNR is designed precisely to counteract some of this high loss from low timesteps.

0 replies

cheald · 2023-09-25T20:52:19Z

cheald
Sep 25, 2023

I took this idea and tinkered a bit.

I set the timestep to be a fixed (1000-global_step), and it produced a very clean U-curve, bottoming out at around min_timestep 400. Observationally, setting the timestep range higher does a better job of absorbing "broad" details, but misses the fine details. Once the timestep dropped under 400, the results got rapidly garbled and nonsensical; much worse than just normal overtraining.

Taking this information, I wondered how training would be affected if the lower timesteps weren't selected as frequently. I reimplemented the random timestep generation to use a standard normal distribution where -6 sigma is min_timestep and 6 sigma is max_timestep, and a mean of (max-min) / 2 + min. From early tinkering, this is producing both faster convergence, more stable loss, and better preservation of the underlying model. Much of the model's "burnout" seems to happen when noising with low timesteps.

sigma = 6
timesteps = ((torch.randn((b_size,), device=latents.device).clip(-sigma, sigma) + sigma) / (2*sigma)) * (max_timestep - min_timestep) + min_timestep

Here's a comparison of loss between the standard implementation and my normal distribution (teal is the modified routine). In both cases, MinSNR=5.

Observationally, knocking out the lower timesteps results in significantly faster improvements to the samples over time. Setting min_timestep too high causes the model to not learn the finer details of the subject, so a balance is needed. A 6-sigma standard normal distribution using [100..1000] as my range (which should give me an average timestep of 550) results in the model learning significantly faster - in the dataset I'm working on, I typically get pretty decent results after 1500-2000 steps, but with this change I've been seeing it approach the same level of fidelity by ~400-500 steps, with significantly less overtraining "damage" to the underlying model.

Here's a comparison of one of my training images, the standard timestep selection routine (min/max range of [0..1000]), and my timestep selection routine with bounds set at [100..1000]. The two generated images are generated after 500 steps of training. All other parameters other than the timestep range and random generation routine were held constant.

For what it's worth, I'm using the Prodigy optimizer with a CosineAnnealingLR scheduler; I suppose tests should also be run with the more standard Adam8bit and a constant learning rate, but the results were a significant-enough improvement that I felt the observation bore sharing.

I have no theoretical basis for any of this, I'm just kind of experimenting and found that this had a massive impact. Intuitively, one thing that might be worth trying is some kind of combination of global step, learning rate and timestep range scheduling, to tilt the timestep ranges higher early on or when the learning rate is higher, and then reducing the lower bound of the range over time to see if it can balance learning the finer details without overtraining all that extra error from the low timesteps.

22 replies

cheald Oct 22, 2023

I've been doing more experimentation with pre-training analysis of the dataset, and I've run across something interesting. When using masking, the untrained loss for the dataset is notably different.

Here's what the aggregate loss curves look like for my dataset. Each image has loss computed at timestep (0..1000 by 20) with 5 random noises, and the losses at each timestep for each sample are averaged together.

This is quite what you'd expect - exponential decay of the loss curve as the noising timesteps increase. Nothing unusual there. But, when you run the exact same routine with masking:

This quite surprised me, but it's encouraging because it suggests that my goal (identifying timestep ranges at which certain features of an image are best learned most easily) is plausible.

Generally speaking, the higher the loss in the earliest timesteps, the lower the loss in the latest, and visa versa. Also generally speaking, the high-early-loss images tend to be those which are closer-up shots of my face, while the late-loss images tend to images with more background in them. I did apply the scaling factor patch from #589, so loss shouldn't be affected by how much of the image is masked, but my hunch is that "percentage of the image masked" is affecting the overall shape of an image's loss curve.

One thing I'm experimenting here with is precalculating loss curves for each item in my dataset, then using those during training to interpret a sample's loss as being relative to the base loss for that sample, the hunch being that this will effectively neutralize the bias imposed by different timesteps, since the loss at any given timestep is easily interpreted as how much noise prediction at that timestep has improved for that particular sample. It's promising so far, but I'm not ready to call it right just yet.

AI-Casanova Oct 22, 2023
Author

Keep up the good work! I expect a full academic paper out of you soon!

cheald Oct 22, 2023

Disabling factor in get_latent_masks produces the more familiar curve, though with obvious striations in magnitude, as one would expect since without any compensation for how much of the image is masked, more masked images will have lower loss.

I think the simplest conclusion here is that that particular factor calculation may not be right. It's quite interesting that it amplified the right-hand portion of the loss curve as much as it did, though.

recris Oct 23, 2023

Good stuff @cheald !

I'm thinking of a couple different approaches to adjusting the loss in the presence of masks

Use textbook "weighed MSE" for loss: I've seen a few different formulations, not sure which one is the best
Restrict MSE to non-black pixels - assuming black and white masks, no extra factors, and mask the loss instead of latents:

loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")
loss = (mask * loss).sum([1, 2, 3]) / mask.sum([1, 2, 3])

Everything here is untested :)

cheald Oct 24, 2023

Your second proposal there is incorrect because loss is a latent tensor of shape (batch, 4, x, y), and mask is a tensor of (batch, x, y); mask gets broadcast and overweights the numerator by a factor of loss.shape[1] (4, in this case).

One of these two would work:

loss = (mask * loss).sum([1, 2, 3]) / mask.sum([1, 2, 3]) / loss.shape[1]
# or
loss = (mask * loss).mean([1,2,3]) / mask.mean([1,2,3])

Quick little unit test to play with the mechanisms:

import torch
torch.set_printoptions(precision=10)

mask_shape = (512, 512)
shape = (4, *mask_shape)
white_mask = torch.ones(mask_shape)

mask = torch.rand(mask_shape)
pred = torch.rand(shape)
target = torch.rand(shape)

l = lambda p, t: torch.nn.functional.mse_loss(p.float(), t.float(), reduction="none")
m1 = lambda m, l: (m * l).mean() / m.mean()
m2 = lambda m, l: (m * l).sum() / m.sum()
loss = l(pred, target)

print("Mean loss\t", loss.mean())
print("Unmasked loss\t", m1(white_mask, loss))
print("Masked loss\t", m1(mask, loss))
print("Naive masking\t", l(pred*mask, target*mask).mean())
print("Sum mean\t", m2(mask, loss))
print("Sum mean/4\t", m2(mask, loss)/shape[0])

# Mean loss	 tensor(0.1664090306)
# Unmasked loss	 tensor(0.1664090306)
# Masked loss	 tensor(0.1664098203)
# Naive masking	 tensor(0.0554658212)
# Sum mean	 tensor(0.6656392813)
# Sum mean/4	 tensor(0.1664098203)

It's not exactly normalized back to the unweighted loss (floating point error?), but it's pretty close, and in empirical testing it seems to do the trick of causing images to be more or less weighted evenly regardless of masking.

cheald · 2023-10-23T19:38:29Z

cheald
Oct 23, 2023

It looks like #889 (and the linked paper) have potentially addressed this problem, as well. I'll be running some tests, but if it's really that simple, then all's the better!

1 reply

cheald Oct 25, 2023

After running some tests, I think that this is indeed the right fix, at least initially. The implementation is a little more complex (and inefficient) than necessary, though. This suffices:

def prepare_scheduler_for_custom_training(noise_scheduler, device):
    # ...
    # No need to recompute this on every step
    noise_scheduler.all_snr_sqrt = noise_scheduler.all_snr.sqrt()

def apply_debiased_estimation(loss, timesteps, noise_scheduler):
    return (loss / noise_scheduler.all_snr_sqrt[timesteps]).clip(max=10000)

I used 10k rather than 1k as the clip, as 1/(alll_snr[997, 998]) yields [1124.5431, 2254.1948]. While it likely doesn't have much visible effect, it doesn't make sense to arbitrarily shortchange those last two timesteps. It also does mean that loss compensation for the 1000th timestep is going to always be wrong; the expected loss for the 1000th timestep is always zero, and it doesn't make sense to interpret observed loss as a percentage of 0. Setting max_timesteps to 999 avoids this issue and likely doesn't cause any real damage, but it makes me wonder if loss / all_snr.sqrt() is the best debiasing mechanism.

In validating this, I ran some experiments that yielded some extremely interesting results, though.

Given fixed latents, zero noise, and constant random text encoder states, loss exhibits a distinct curve, losing accuracy above the 700-ish step range, demonstrating that the network does best at predicting the noise residual somewhere in the 700-timestep range.

(Note: This isn't the same as a black image; I skipped the VAE encode step so that we're only testing the unet's ability to predict a controlled noise.)

Here's the general form of the code used to produce the following graphs:

with torch.no_grad():
    batch_size = 10
    torch.random.manual_seed(0)
    latents = torch.ones((batch_size, 4, 64, 64), device=device, dtype=torch.bfloat16)
    noise = torch.zeros_like(latents)
    hidden_states = torch.randn((2, 768), device=device, dtype=torch.bfloat16).expand((batch_size, 2, 768))
    timesteps = torch.arange(0, 1000, step=1, device=device, dtype=torch.long)
    dataloader = torch.utils.data.DataLoader(timesteps, batch_size=latents.shape[0], shuffle=False)
    sampled_timesteps = []
    all_losses = []
    for step, ts in tqdm(enumerate(dataloader), total=len(dataloader)):
        noised = noise_scheduler.add_noise(latents, noise, ts)
        noise_pred = unet(noised, ts, hidden_states).sample
        losses = torch.nn.functional.mse_loss(noise_pred.float(), noise.float(), reduction="none")
        del noise_pred
        del noised
        sampled_timesteps.extend(list(ts.cpu()))
        all_losses.extend(list(losses.mean([1,2,3]).cpu()))
plt.plot(np.array(sampled_timesteps), np.array(all_losses))

(This graph is mislabeled, the TE states are held constant here)

Using random TE states introduces some noise, but confirms a pattern and amplifies the evidence of a decrease in fidelity as steps increase past ~750 or so (and holy cow, the change in variance is smooth!):

Increasing the latents to 1 shows a different loss shape, but still clear evidence of a minimum and then increasing loss past that:

Further increasing the latent to all 3.0 produces a gentler curve, but it still blows up at the end:

And this one is really wild: when the latents are drawn from the standard normal distribution, we get a loss curve peaking at ~390 and what looks like no variation and extremely small loss at low timesteps...

latents = torch.normal(0, 1, (batch_size, 4, 64, 64), device=device, dtype=torch.bfloat16)

...but if we look at it on a log scale, we do in fact see that variation at the upper end show up, and an overall increase in loss for the last 100 timesteps or so.

And here's rand latents (interval [0, 1)):

latents = torch.rand((batch_size, 4, 64, 64), device=device, dtype=torch.bfloat16)

On a log scale, it's evident that there's still a loss of fidelity at the low end:

I'm pondering what the implications of these various findings are, but my hunch is that it demonstrates that not all latent distributions are equal in terms of the network's ability to effectively denoise them at different timesteps, and that by understanding the relationship better we might be able to more accurately guide the network towards accurately predicting noise across across all timesteps.

Update: Here's some more exploration with a real image, and some thoughts.

Using my random TE tensors and no noise, here's the loss for one of my training images. Applying the proposed debiasing scheme to it vastly amplifies the relative strength of the last 100 timesteps or so at the upper end.

I've been thinking about the variance bands demonstrated by the random TE tensors, and I think I have a theory there: the broader the band, the more the text encoder's influence matters in overall loss at that timestep! The variance introduced by the random TE tensor essentially produces a visualization of how much the TE improves or harms the overall noise residual at that timestep. Look at the normalized loss in that last 100 timesteps: the effect of loss sqrt(SNR) loss normalization is that the effect of the text encoder is massively magnified in that last 100 timesteps or so, and it drastically flattens the relative influence of the TE on the loss across the entire rest of the curve.

The conclusion that we can draw from that is that we can estimate the relative overall influence of the TE vs the unet on overall loss at a given timestep: te_influence = band width / 2 and unet_influence = mean - te_influence.

If I look at just the first 850 steps:

The TE variance is pretty tight up until around timestep 150, peaks around 300, and then it remains relatively stable in its influence up to 850.

We can see that the relative influence of the text encoder may outpace the influence of the unet past roughly step 850 or so - which is exactly where the debiased curve blows up and takes off. This makes me wonder if we should be using two debias factors: one for the unet, and one for the TE. This potentially has large implications for unet and te learning rates, as well.

cheald · 2023-10-26T22:04:35Z

cheald
Oct 26, 2023

I think I might have stumbled into something extraordinary, and want to throw this out there to get other brains on it.

One of my observations in my experiments is that forward-noising applies noise as sample * sqrt(alphas_cumprod) + noise * sqrt(1 - alphas_cumprod). This looks similar to your standard image blending function: img1 * alpha + img2 * (1-alpha), but it's not, because sqrt(alphas_cumprod) and sqrt(1 - alphas_cumprod) have a nonlinear relationship. The sum of them looks like:

This shape has shown up in other experiments (you'll notice it in my post above in the loss explorations), which has made me wonder if there's a fundamental bias in the forward noising mechanism - there will be much more total information (sample + noise) in the latent at step 400 than there is at steps 10 or 800. The pixel values will always have their largest magnitude at the ~400ish step peak. In abstract, maybe this doesn't matter, because the unet should learn to predict noise regardless of the function used to generate the noise, so long as that function is differentiable, right? But what if it's actually a source of bias in training?

At first I thought that this wouldn't change things that much, but that we might be able to debias it by dividing both noise and target by (noise_scheduler.alphas_cumprod.sqrt() + (1 - noise_scheduler.alphas_cumprod)[timesteps]. I tried this, and I think it might have marginally improved my training run. But then, out of curiosity, I tried applying it to just one half of the loss (noise_pred), and magic happened.

if args.apply_noise_compensation:
    noise_comp = (noise_scheduler.alphas_cumprod.sqrt() + (1 - noise_scheduler.alphas_cumprod).sqrt()).to(device=accelerator.device)
    noise_pred = noise_pred / noise_comp[timesteps] # .sqrt()

loss = torch.nn.functional.mse_loss(noise_pred.float(), target.float(), reduction="none")

My first samples were blurry. REALLY blurry. But they "felt" much more like my subject than samples of the same timestep were with the standard routine. Furthermore, as the epochs progressed, the samples retained significantly more of my subject's ineffable quality, but they sharpened up! The longer the training ran, the clearer my samples became. Around epoch 30 or so, they were essentially back to full resolution, but without the mutations, distortions, or burnout that is characteristic of overtraining in that range prior, and the samples had astonishingly preserved most of the "shape" from the underlying model.

This smells, in principle, much like what the ip_noise_gamma setting (based off this paper: https://arxiv.org/abs/2301.11706 ) is supposed to do - creating an intentional blind gap between the noise and target is supposed to help with regularization. I've used ip_noise_gamma to maybe some sleight benefit, but using a "nonlinear gamma" based off the sum of the noise scheduler information magnitude MASSIVELY increased this effect.

When I then pulled the LoRA into SD and tried it against various models other than the one I trained it on (RealisticVision) and in combination with other LoRAs, it feels like it generalizes FAR better than I've ever accomplished before. This felt like a fundamentally superior output in terms of fidelity AND flexibility.

Training details: AdamW8bit (modified to use the the AdaBelief term) constant LR, of 8e-6, algo=full, train_norm=True, preset=attn-mlp. This was done without the debiased_estimation_loss flag or SNR in play - very vanilla settings. I'm running another training run with debiased_estimation_loss on to see what impact that has, as well.

What I'm now wondering is a) why does this work? and b) if we can bias the model towards "level of detail" by intentionally selecting the curve by which noise_pred is modified before loss calculation. What I'm envisioning is some kind of "10-band EQ" where you could drag the learning rate of certain LODs up or down.

Here's a video of my training run over 39 epochs.

interpolated.mp4

As you can see, it learns the "large outline" of my features early on, but then improves incrementally on the level of details in the image WITHOUT losing those "large features". In much of my previous experiments, I've found that both large and small features got learned at the same time, and that frequently I'd end up with models where either there was too much "large detail" (and the shape of the outputs got distorted and nonsensical) or there wasn't enough "small detail" (and the fidelity of the subject didn't feel right). This feels like it sidestepped that issue entirely and gave me an unexpectedly flexible high-fidelity output.

I would appreciate any insight into what might be happening here, and how it might be understood to improve the ability to better control training.

9 replies

cheald Mar 25, 2024

I unfortunately threw away my changes at some point, but I think you've got the gist of it.

I ultimately concluded that while this did have some regularization potential, it can very easily be too extreme, too, and I wasn't sure that it was worth the effort over other regularization techniques. I couldn't get the effect to generalize across datasets.

I have been experimenting with modifications to the latents during training, based on observations from https://huggingface.co/blog/TimothyAlexisVass/explaining-the-sdxl-latent-space and I think there might be something happening there.

In particular, the observation that the 4th channel is "shape" explains a lot about both the blurriness (and the sepia tones - which are characterized by flat luminosity ranges AND flat color ranges). By flattening the noise prediction in the "high signal" ranges, we're essentially lying about how good the unet's prediction was for those timestep ranges (telling it that it wasn't as good as it thought), and penalize it more the closer it is to that "signal peak" so that it learns lower values. This causes it to initially be pulled towards an understated range of the appropriate noise in those time ranges. Optimizers being what they are, it eventually starts predicting noise values which, even after flattening, minimize loss.

I'd like to do some new experiments, such as:

Splitting the scaling value per channel in the predicted noise (ie, you could lie to it about shape, but not color, or visa versa)
Evaluating the effects of scaling the noise prediction for a given channel for a given timestep range
Evaluating the effects of imposing scaling on the full latent prior to noising

Overall, any of these techniques should theoretically make it harder for the optimizer to do its job, but if it's getting really good at one particular element (say, shape) while it's still struggling with another (say, luminosity) then it might be that careful selection of a regularization factor could produce subjectively superior results.

cheald Mar 26, 2024

FWIW, this discussion inspired me to fire up another experiment: just dividing the noised latent by the magnitude of sqrt(alphas_cumprod) + sqrt(1 - alphas_cumprod)

# prior to self.call_unet in train_network.py

noise_comp = (noise_scheduler.alphas_cumprod.sqrt() + (1 - noise_scheduler.alphas_cumprod).sqrt()).to(device=latents.device)
noisy_latents = noisy_latents / noise_comp[timesteps].reshape(-1, 1, 1, 1)

Initial results are promising; it seems to be picking up quality from the training set faster than before, but doesn't have the initial blurring/loss of shape issues. It remains to be seen how this plays out at higher epochs in overtraining territory, and how it generalizes, but it's something worth trying out.

SQCU Mar 30, 2024

Tried out the noisy_latents tweak. Seems to be compatible with debiased estimation noise, produce reasonable results in separating different artstyles and compositions at 60+ repeats of a dataset.
I'm still not sure if I've found a way to 'overtrain' a model with these approaches, at least with the training durations and dataset sizes I'm willing to sit through twice (to test with and without debias estimation noise).

kabachuha Mar 31, 2024

@cheald While pure Huber loss misses the fine details as you noted, it is possible to use timestep-Scheduled Huber loss, with timestep-dependent parameter, making it look like L1 when the image only begins to form (first reverse diffusion timesteps) and thus more resilient to outliers, and like L2 when the image is almost ready, making it learn the fine details!

In our paper we did a lot of tests, with different parameters, datasets and schedules and confirmed that indeed scheduled Huber outperforms both Huber and L2 by far!

https://arxiv.org/abs/2403.16728

See also the discussion in the Diffusers repo huggingface/diffusers#7488 and the linked code in the PR

cheald Mar 31, 2024

Amazing! I had experimented with Huber loss but it didn't occur to me to schedule it. I'm very excited by your results and am going to be doing a lot of experimentation with it.

hts2008 · 2023-11-29T16:43:35Z

hts2008
Nov 29, 2023

Hi @cheald ,

Hope you are well.

Thank you for the info about the timestep.
Currently, I used kohya-gui, I have a concern that what the number of 1000_global_timestep in min/max timestep? or -6 sigma and 6 sigma?

Thank you.

0 replies

drhead · 2024-04-07T02:43:47Z

drhead
Apr 7, 2024

Hi @cheald,

I have also been experimenting with this problem, and I believe I have the true ideal solution, which should end the need to tune timestep weightings.

The traditional solution to this would be multi-task loss which is structured in such a way that timesteps would get extra weight depending on their difficulty (e.g. more difficult timesteps are weighted higher, less difficult ones are lower weight), so that all timesteps contribute the same amount towards the image. The weights would be trained as a tensor with one element for each timestep. The main problem with this, is that it is not very stable on smaller batch sizes that anyone who isn't training a foundation model would use -- instead of optimizing the parameters, what you'd end up doing with this is effectively smacking them across the room every now and then, since you're not going to touch every timestep every training step. It is not likely to converge properly unless using a batch size of at least 256 or so (very conservative estimate, it probably needs more), and is inefficient for learning since we would expect nearby timesteps to have similar difficulty and this doesn't capture that.

The EDM2 paper (repo: https://github.com/NVlabs/edm2) includes a different form of this multi-task loss in the form of a single-layer MLP with no activation function that takes noise level (sigma) as input. While this was designed to be used with a continuous noise schedule like the EDM models use, I have found that it works extremely well on discrete timestep schedules, and most importantly for us, it allows multi-task loss to work on smaller batch sizes efficiently. From my testing (still ongoing), I have found that this drastically improved results over the debias schedule you noted that you used before, and interestingly, the weightings it chose did not look too much like other weightings I had been recommended before. I have also had wonderful results with scheduled pseudo huber loss in combination with this.

One remaining problem with the learned timestep weightings, though, is that it most likely will take a longer time to converge than most short training runs will use. My tests so far have been with a full finetune with a virtual batch size of 64. I do get good, fast convergence with this method when I use the recently released Schedule Free optimizer (https://github.com/facebookresearch/schedule_free/tree/main), where I get fairly close to the final schedule within about 500 steps with an LR of 0.005. Regardless of whether it is viable for use on all training run durations for your use cases, I am sure that you could use it on a longer training run for the purpose of discovering better timestep weight schedules. The MLP is also formulated in a way that it accepts a "baseline" timestep weighting of sorts (noted in the paper's formulas as lambda(sigma)) -- I would recommend doing a training run until the timestep weights seem to converge, then run a regression over it and have that as the baseline for whichever prediction mode you're using. I would expect that discovering an appropriate one for epsilon and an appropriate one for v-prediction would be sufficient, the main other factor that would affect things is resolution and of course the noise schedule's shape itself, but I would think those should still be in the same ballpark and within reach of virtually any training run.

8 replies

drhead Apr 7, 2024

You should definitely be charting 1/exp(MLP outputs) (or λ(σ)/exp(MLP outputs) if you're using a baseline schedule) instead of just the raw outputs, it should make it much easier to visualize what the network is trying to do (the network isn't operating in the space it is because those numbers are useful, it's that way for numerical stability only essentially). I am curious what the curve for epsilon ends up settling as. I also made a similar mistake on the sigma input -- ultimately, it won't drastically alter the outputs as long as the input is still generally correlated with noise levels.

The negative loss is also not a cause for concern at all. It'll stop looking so alarmingly wrong before you know it, probably sooner if you add the other loss metrics I use.

cheald Apr 7, 2024

Here's my take on it: cheald@0afee96 - it adds sigma_uncertainty_model as a path to a safetensors file of weights to use (or to create, if it doesn't exist), and a train_sigma_uncertainty boolean indicating whether the weighting network should be trained or not.

I'm still seeing overall negative losses even after changing my sigmas, so it's possible I still have something off. However, it still appears that it's improving training results, so I'm not worried about it. 3 iterative passes, with the MLP being continually trained, starting from initially untrained. Each pass is 900 samples (30 epochs @ 10 steps/epoch @ batch size 3):

And the curve after 3 passes. It's worth noting that I used the same seed for all 3 passes, and I think there are some timesteps in the 0-50 range that aren't being hit; the curve was really jagged there until the third pass. Multiple passes with different seeds (and thus sampled timesteps) may help smooth it out.

And here's samples from the training, per epoch. These are huge, but if you compare them, I think it does a good job of illustrating the acceleration in learning.

It's worth noting that I'm using a couple of new techniques in here, as well, which may be confounding things:

I'm learning the multires discount per channel, rather than a static fixed configuration value (multires_discount_lr=3.5e-2).
I'm learning offsets in latent space which I use for "latent centering", which I've found improves both dynamic range and color neutrality in training (ie, if you have photos in your training set which have an extreme color temperature or contrast, it compensates for that pretty decently) (latent_centering_lr=5e-2)
I'm applying a corruption to the latents prior to noising them, by reducing the magnitude of x% of the latent values by a random value. This is a regularization technique which I've found vastly improves the ability to learn the subject, and has a really nice benefit of helping the model to not learn JPEG artifacts or digital camera sensor noise. (latent_corruption=0.0008)

I'll run some more tests with those passed off when I get a chance, as well.

Pass 1:

Pass 2:

Pass 3:

I'll add 1/exp(MLP outputs) to my charting as recommended.

cheald Apr 7, 2024

Initial tests with that logging show I've got something off still:

I've got other stuff to do this afternoon, but will see if I can chase it down soon. That explains the continually decreasing loss, though.

drhead Apr 7, 2024

ah, I think I wasn't specific enough there -- I was suggesting 1/exp(MLP outputs) for your matplotlib chart of the timestep weightings mainly, since that will tell you what the actual scaling factor is on a given timestep.

As far as loss metrics to record go, here's my example of how my loss metrics behave:

"Objective Loss" is equivalent to your "loss", and yes, it is negative on mine. That's just part of the multi-objective loss function, and while it does look bizarre at first, I can assure you it is completely normal. As you can see from mine, it isn't completely unbounded, and it'll settle on some (very likely negative) value eventually. That's why I recommend tracking raw MSE/SPH loss on the side, so you actually have something meaningful to track.

The "scaled training loss" on mine (refer to my code example for how I get that value) is the most useful way to track the convergence of the loss weight curve -- as you can see on mine, it settles on 1. You can also see that once it settles, the objective loss also settles, which helps demonstrate that it really is not acting as unbounded. So don't worry about the negative loss! Your implementation appears to be working completely fine. It's just trying to keep adaptive_loss_weights low to satisfy the additive term of the loss objective, while also keeping them high so that loss * (lambda_weights[timesteps] / torch.exp(adaptive_loss_weights)) comes out to a low value, and it just happens that it gets into the negatives doing that, but it can't keep going forever.

drhead Apr 14, 2024

I do want to follow up on this since I discovered that I made an error in my implementation. The inputs are handled incorrectly. You are supposed to standardize the inputs by expectation as part of the design of the magnitude-preserving layers of the model. The sigma.log() / 4 part of this is doing that for the EDM schedule which we are not using.

To correct it, I suggest that you change the input from sigmas to just noise_scheduler.alphas_cumprod, and then normalize it on expectation. Simply take the mean and standard deviation of the whole alphas_cumprod array at the start, and store those values somewhere. For handling your inputs to the model, instead of doing .flatten().log() / 4 do .flatten().sub_(mean).div_(std). That should make the model converge better and it actually fits the design pattern for EDM2.

cheald · 2024-04-11T05:30:26Z

cheald
Apr 11, 2024

I've developed a way to do posthoc analysis of a lora to see WHERE it improved loss on your training set.

The basic idea is simple: Take a training dataset, a lora, and a model. Load each training sample, noise it by every 50th timestep, do noise prediction, and take loss as the standard MSE(noise, noise_pred). Then, load the lora and do the same thing. Compare the ratio of the new noise and old noise; if it's below 1, then training improved loss. It loads in masks, if they're present in a /masks/ subdirectory, and masks losses with them if they exist.

Here's the samples at the 20th epoch. Observations: Fine details, very warm overall tone, blurry and low-detail backgrounds. The color depth feels a bit flat, but the textures are decent.

And here's the loss ratio plot. What I've found is that the stock training regieme is good at reducing loss at higher timesteps, but has a much harder time with lower timesteps. Forgive the lack of labels; the X axis is "timestep / 50" (ranging from 0-20, which expands to 0-1000), and the Y axis is the ratio of baseline loss:lora loss (higher means the lora reduced loss):

(Edit: I realized this morning that I was using baseline/lora rather than lora/baseline, so that changes my interpretations, which I've updated)

Additionally, I can plot statistics PER SAMPLE to find pathological samples in my dataset which are not converging!

The red line is a ratio of 1.0, and the box plot plots the loss reductions across all 20 sampled timesteps, with the typical mean, median, and 1SD. This is really useful for finding samples which the optimizer has outsized trouble converging.

By comparison, here's a run where I experimented with using noisy_latents / (alpha_cumprod.sqrt() + (1-alpha_cumprod).sqrt())[timesteps] (on the theory that the greater overall information in the noised latents in the "fat" part of that curve is doing funny things.

Observations from these samples: Much more neutral colors, better dynamic range, but the likeness isn't quite as good. The teeth are better (and this has held through my experiments; the teeth overtrain first, but with this technique they remain fine the whole time):

And here's the loss plots. The loss on the high end has tangibly improved. If I let this training run go for 60 epochs, it does a GREAT job at learning structure and form, but doesn't quite get details.

What's interesting here is that the tail end flipped, but the loss change as a percentage on the low end didn't change much at all. This might be due to the lower absolute values on the high end, but it's interesting that the first part of the curve didn't change much.

Here's the notebook. It should go in your sd-scripts directory as it uses a few utility functions from sd-scripts to ease model loading. Right now it's just working with SD1.5 but it shouldn't be hard to extend it for SDXL or whatnot. My hope is that lessons learned in SD15 land can be applied to SDXL, since SD15 is a lot faster to run experiments with.

loss_analyzer_cleaned.zip

0 replies

cheald · 2024-04-15T08:18:47Z

cheald
Apr 15, 2024

I am pretty sure that I've directly identified a cause of the original observation in this issue. Essentially: Given a static noise, and then forward noising a latent with that noise, and then predicting the noise from that noised latent, earlier timestamps consistently end up with a lower overall magnitude of noise.

with torch.no_grad():
    timesteps = torch.arange(0, 1000, 25, device=dev)
    latent = encode_path_sd15(image, 128)
    latent = latent.expand(timesteps.shape[0], *latent.shape[1:])
    noise = torch.randn((1, *latent.shape[1:]), device=latent.device).expand(latent.shape)

    _, text_embeddings = prompt_to_cond(["man"], latent.unsqueeze(0))
    noisy_latents = noise_scheduler.add_noise(latent, noise, timesteps)
    noise_pred = unet(noisy_latents, timesteps, text_embeddings.cuda()).sample

    fig = plt.figure( figsize=(20, 50) )
    for i, timestep in enumerate(timesteps):
        ax = plt.subplot(10, 4, i + 1)
        ax.set_title(f"Timestep {timestep}")
        n = noise[i].flatten().cpu()
        ax.hist(n, alpha=0.5, bins=100, color="red")
        ax.hist(noise_pred[i].flatten().cpu(), alpha=0.5, bins=100)
    plt.show()

Red is the true noise (held constant), and blue is the predicted noise. You'll notice that at t=0, the predicted noise histogram has a significantly narrower distribution of noise, and the magnitude of noise increases as the timestamp increases.

This will plainly lead to significant differences in loss for lower timesteps.

If we plot noise_pred.abs().quantile(0.999) / true_noise.abs().quantile(0.999) by timestamp, we end up with something like this:

This is essentially the difference in the magnitude of the true noise and predicted noise at each timestep for a given static noise.

I regressed this curve to roughly 4.312e-01 * timesteps.pow(1.215e-01), which in theory would give us a compensation factor to apply to either noise, noisy_latents, noise_pred, or loss (though that's likely harder than just modifying one of the other input factor magnitudes directly).

I tried scaling noise_pred directly first, and this does something very interesting: it causes what feels like a relative "hyperfocus" on detail, resulting in over-sharpened (one might even say "overtrained") images. This is after only 4 epochs, but the pattern holds.

noise_pred.div_(4.312e-01 * (timesteps + 1).pow(1.215e-01).view(-1, 1, 1, 1))

But, okay, if the issue is just magnitude, we can directly normalize the noise_pred to the standard deviation of noise:

noise_pred = noise_pred / noise_pred.std() * noise.std() # noise.std() should be pretty close to 1 here, so maybe unnecessary?

This results in a significantly higher level of subject detail (and feels perhaps the most photoreal result I've achieved yet) but the background basically entirely disappeared in all my samples. My first thought is that I wonder if it's related to masked training, but I wouldn't think so, since noising and noise_pred normalization are both applied without respect to masks.

At any rate, there's a dial here to play with here. I suspect there's more to the nature of noise vs noise_pred than just the difference in noise, but the noise is definitely a good clue.

Perplexingly, normalizing noise to noise_std (which should produce similar loss values, I think?) does NOT produce similar results:

noise_pred = unet.call(...)
noise = noise / noise.std() * noise_pred.std()

I don't have an explanation for this, so there's clearly something else in play here that I'm missing.

6 replies

SQCU Apr 17, 2024

Curious about the captioning / training scheme you're applying when you're getting these no-background results. does the most recent (and most background-ablating) training method have the same behavior if you train with dropped out captions?
if there's a difference in the degree to which backgrounds are preserved or lost depending on the rate of caption dropout it might be easier to understand if the missing backgrounds reflect a stronger or weaker absolute adherence to the literal training data (image, prompt dyads).

cheald Apr 18, 2024

I am using the multiple caption method recently introduced, with 3 captions per image, randomly sampled per sample. I'm also training with masked loss, which should, in theory, disregard the background of my image entirely. I haven't played any more with that training method yet, but it's worth experimenting with.

I did go back and read the DDPM paper last night, and learned that not only is the loss discrepancy known, it's intentional. Here's the objective function as originally defined:

But they intentionally dropped the weighting factor:

I did try a training run last night, with:

betas = noise_scheduler.betas.to(device=accelerator.device)
alphas = noise_scheduler.alphas.to(device=accelerator.device)
alphas_cumprod = noise_scheduler.alphas_cumprod.to(device=accelerator.device)
weight = ((betas**2) / (2 * betas * alphas * (1 - alphas_cumprod)))
loss = loss.mean(dim=(1,2,3)) * weight[timesteps]

This did, in fact, have the effect of flattening the loss change ratio across timesteps:

I do think the samples were worse, though, in accordance with the paper - they had better fine detail, but less general subject fidelity. I think that can probably be chalked up to that "elbow" at around T=550.

It's interesting that this weighting factor heavily smoothed out the T<550 range, while resulting in a generally linear increase in the loss ratio T>550. I would have expected it to preserve the loss ratio across all timesteps.

TopSalad3530 Apr 21, 2024

noise_pred.div_(4.312e-01 * (timesteps + 1).pow(1.215e-01).view(-1, 1, 1, 1))
noise_pred = noise_pred / noise_pred.std() * noise.std() # noise.std() should be pretty close to 1 here, so maybe unnecessary?

Anecdote: I'm currently trying these changes for a moderate scale (~3k images + repeats for balance) anime style dataset, and so far it seemed to have been immensely helpful in preventing the distortions, blur, and deep-fry effects that is usually referred to as "overfit" around the SD community. Before this, I've tried all the dials and knobs I can find, but invariably the degradation begins to sets in before the samples really start to resemble the input, usually around the mark of epoch 20. Right now though, I'm 50 epochs in, the model is still improving, and there's not a single sign of the usual artifacts in sight. This is amazing.

I did not observe any disappearing backgrounds from my samples. Perhaps it's something that's only triggered by realistic data or masked loss. I'll report back with further sustained training if any adverse effects appear (apart from true, conventional overfit, that is). I think this would still be a very valuable addition to the library, even if it only really worked on unmasked illustrations. It would be huge if it turned out that we could actually have trained our models for far longer than we thought.

cheald Apr 21, 2024

The "deep fry" effect from overtraining is caused by the network learning to predict more and more extreme noise. Extreme noise causes extreme latents, which causes extreme values from the decoded latent. Once the predicted latent starts to exceed certain bounds, the output of the VAE decode exceeds the clip ranges and you get "fry". You can observe this trivially by just increasing the magnitude of an encoded latent then decoding it:

latent = encode_path_sd15("4zor47.jpg", 386)
plt.figure(figsize=(12, 9))
for i in range(0, 12):
    noisy_latent = latent * (1 + 0.1*i)
    plt.subplot(3, 4, i+1)
    plt.plot(f"Latent magnituide {noisy_latent.std():2.2f}")
    plt.imshow(decode_sd15(noisy_latent).cpu()[0])

Preventing this is a matter of keeping the network from predicting extreme noise values. This could be done as simply as adding a loss penalty term based on the magnitude of noise_pred. Naively, adding some variation of (noise_pred.std(dim=(1,2,3)) - noise.std(dim=(1,2,3))).clip(min=0) to loss might by sufficient to prevent deep-fry. It's worth trying.

Dividing noise_pred by that curve has the effect of teaching the network to learn lower noise magnitudes in those earlier steps. Because losses tend to be highest in the earlier timesteps, the default scheme learns to overweight the early timesteps, and achieves much of its overall loss reduction by overtraining those timesteps. For whatever reason, as time goes on, this tends to cause it to learn more and more extreme noise values - likely with wider variance, although still around a mean of 0. As the variance exceeds the bounds of the VAE's clip range, we get "fry". The manual adjustment from this curve teaches the network to predict "artificially" lower values of noise for those early timesteps, which is likely why it doesn't fry out. However, this could also be harming the network's ability to learn the fine details which really differentiate a subject, so I'm not convinced that it's the ideal solution.

I've also had some luck with weighting timestep selection, too, rather than just doing uniform timestep selection, doing something like this in get_noise_noisy_latents_and_timesteps:

weighting = (alphas_prod**0.5 + (1-alphas_prod)**0.5)
probs = weighting[min_timestep:max_timestep]
timesteps = torch.tensor(
        list(torch.utils.data.WeightedRandomSampler(probs, b_size, replacement=True)),
        device=latents.device) + min_timestep

I don't think the weighting scheme is ideal, but it's very clear that weighting the network to more often learn timesteps in the 400ish range produces perceptually-superior results. This is consistent with the observations from https://arxiv.org/pdf/2404.07946.pdf - particularly their curriculum learning based timestep schedule strategy.

Additionally, I think the naive regression of 4.312e-01 * (timesteps + 1)1^.215e-01 can be improved by expressing it in terms of alphas/alpha_prod/snr. I don't have the formulation, but I do recall noting that it was similar to other curves I've seen in my experiments.

cheald Apr 21, 2024

Okay, so adding noise std/mean as loss targets is RIDICULOUSLY effective. In theory, it should prevent "deep fry" from ever happening, too. I haven't trained enough on it to validate that it prevents frying, but it should, in theory, work just fine.

This should at least partially fix the discrepancy introduced by removing the VLB weighting factor, without the need for a supplementary network ala EDM2. By just saying "we expect you to learn to produce noise of the same variance and mean as the true noise", we can retrain the network to do so organically. With the importance of matching the noise variance included in the objective function, we should have less timestep-variant loss attributable to difference in overall predicted variance.

My modification is:

noise_std_loss = torch.nn.functional.mse_loss(noise_pred.std(dim=(1,2,3)), noise.std(dim=(1,2,3)))
noise_mean_loss = torch.nn.functional.mse_loss(noise_pred.mean(dim=(1,2,3)), noise.mean(dim=(1,2,3)))
loss = loss.mean([1, 2, 3]) + noise_std_loss + noise_mean_loss

That is, we don't just care about minimizing overall noise error, we also care about minimizing deviation from distribution of noise. Intuitively, standard MSE loss should do this, but in practice, we can observe that the DDPM noise schedule and unweighted loss term results in the network learning different noise distributions across timesteps. By adding it as an explicit loss term, we can make sure that the network explicitly cares about predicting noise that matches the overall variance of the true noise.

Without the std loss term:

With:

This is a fairly remarkable result, because it's the first time I've seen a mechanism which brings both tail ends of the process up into that "U" shape, rather than resulting in the higher timesteps flying past an inflection point. I'm undecided yet if this is a net positive or not.

Noise distribution with the std loss term added. The earlier timesteps are still undershooting std after 20 epochs, but they're definitely much closer to true.

And then check this out - this is the observed:true noise ratio, plotted against my regressed curve from earlier:

Early t are still low, but it's clear that it's coming up to neutral much more aggressively than before. I'm going to keep experimenting with this approach for sure. Combined with weighted timestamp sampling, this could end up doing some really interesting things for high-epoch training.

cheald · 2024-04-22T05:49:18Z

cheald
Apr 22, 2024

Okay, so initial runs show extreme promise. I've added multiple additional loss objectives:

mse(noise_pred.std, noise.std)
mse(skew(noise_pred), skew(noise))
mse(kurtosis(noise_pred), kurtosis(noise))
kl_div_loss(p(noise_pred), p(noise))

(Edit: I'm doing more testing, and it might be that kl_div loss alone is sufficient for this effect; it keeps us in the right "neighborhood" but allows more flexibility")

They can be tested individually, or combined. Each has a weight, and the individual objectives are summed and added to the overall loss. The results are really, REALLY good.

# in train_utils.py
def noise_stats(noise):
    mean = noise.mean(dim=(1,2,3)).view(-1, 1, 1, 1)
    std = noise.std(dim=(1,2,3)).view(-1, 1, 1, 1)
    skew = torch.sum((noise - mean)**3 / std**3) / (noise.numel() / noise.shape[0])
    kurt = torch.sum((noise - mean)**4 / std**4) / (noise.numel() / noise.shape[0]) - 3
    return skew, kurt

def stat_losses(noise, noise_pred, std_loss_weight=0.5, kl_loss_weight=3e-3, skew_loss_weight=0, kurtosis_loss_weight=0):
        std_loss = torch.nn.functional.mse_loss(
                noise_pred.std(dim=(1,2,3)),
                noise.std(dim=(1,2,3)),
            reduction="none") * std_loss_weight

        skew_pred, kurt_pred = noise_stats(noise_pred)
        skew_true, kurt_true = noise_stats(noise)

        skew_loss = torch.nn.functional.mse_loss(skew_pred, skew_true, reduction="none") * skew_loss_weight
        kurt_loss = torch.nn.functional.mse_loss(kurt_pred, kurt_true, reduction="none") * kurtosis_loss_weight

        p1s = []
        p2s = []
        for i, v in enumerate(noise_pred):
            n = noise[i]
            p1s.append(torch.histc(v.float(), bins=500, min=n.min(), max=n.max()) + 1e-6)
            p2s.append(torch.histc(n.float(), bins=500) + 1e-6)
        p1 = torch.stack(p1s)
        p2 = torch.stack(p2s)

        kl_loss = torch.nn.functional.kl_div(p1.log(), p2, reduction="none").mean(dim=1) * kl_loss_weight

        return std_loss, skew_loss, kurt_loss, kl_loss

# in train_network.py
std_loss, skew_loss, kurt_loss, kl_loss = train_util.stat_losses(noise, noise_pred)
loss = loss + std_loss + kl_loss + skew_loss + kurt_loss

loss = loss.mean()  # 平均なのでbatch_sizeで割る必要なし

Charting all those metrics - std, skew, kurtosis, and kl_div shows that despite the classic loss objective improving, as training continues, various metrics go wonky. But, we KNOW that the desired noise target has a consistant std, skew, and kurtosis. My hunch here is that models overtrain by learning noise predictions which do not resemble IID gaussian, and using metrics like std, skew, and kurtosis as objectives keeps them from getting too far afield.

std starts low, around 0.9-0.95, but starts to creep up. It eventually keeps climbing, and once it's sufficiently > 1, you start to see "fry". The std objective keeps learned noise in an acceptable range, and prevents that kind of overtraining from manifesting.
skew and kurtosis loss both tend to increase across a training run when not used as part of the objective function. I theorize that, like std, the network is learning rather odd noise shapes used to optimize just the most important timesteps, but that these odd shapes have a detrimental effect on other timesteps. Keeping the noise distribution closer to gaussian seems to substantially help.
kl_divergence is all over the place when not used as part of the objective function. I'm still playing with it, but it should theoretically kind of work like a proxy for std/skew/kurtosis all together. It might be viable to use on its own - I'm still experimenting. Right now I have the bin size fixed to 500, but that should probably be a hyperparameter.

(Edit: I'm using torch.histc for divergence which isn't differentiable, so it's not working quite correctly WRT the backwards pass. I'm reimplementing with a soft histogram instead.)

This DOES unfortunately add four (!) new hyperparameters - weights for each new loss type - but so far the values I've used are producing astonishingly good results. After 20 epochs, I'm getting remarkably good samples, with no sign of overtraining, yet.

I'm still running a training run, and don't have the vram to run analytics on a trained lora while it's going, but I'll get some timestep error distribution graphs once this run completes. I would very much like others to give this a go and let me know what kinds of results you end up with.

Edit: Here's the timestep error graph after 36 epochs. It's still a little funny at the extremes, but this is a SIGNIFICANT improvement overall.

4 replies

TopSalad3530 Apr 22, 2024

kurt_loss seems to be NaN for me. I'm currently trying this with the term removed, seeing that its weight defaults to zero anyway. Maybe there are cases where std**4 underflows? Perhaps it might be more stable to calculate it as ((noise - mean)**2 / std**2) ** 2?

For now I'm resuming from my previous checkpoint, which has by now seen 100+ epochs of training and is still looking pristine, if a little over-saturated, but nowhere nearly as bad as what you typically would expect. I think I'll want to try re-training from scratch for comparison once I'm done, which I sense might be soon, given just how shockingly well these methods have been performing so far.

Curiously, judging purely by perception, I've found these changes to help with generalization as well. Perhaps by removing whatever pressure (or pushing against, for that matter) that was causing the models to fry, we're allowing the regularization techniques to better shine through. Amazing work, in any case. I'm very excited to see how this might turn out.

cheald Apr 22, 2024

This is a much more straightforward formulation:

def noise_stats(noise):
    diff     = noise - noise.mean(dim=(1,2,3), keepdim=True)
    zscores  = diff / noise.std(dim=(1,2,3), keepdim=True)
    skews    = (zscores**3).mean(dim=(1,2,3))
    kurtoses = (zscores**4).mean(dim=(1,2,3)) - 3.0
    return skews, kurtoses

Maybe try that and see what you get? If you can, nail down if you're getting NaNs in the forward or backward pass. I suspect backward.

I let my training run go for 150 epochs last night (using std_loss_weight=1.0, kl_loss_weight=0.004, skew_loss_weight=0.75, kurtosis_loss_weight=0.05), and the results are pretty conclusive. The classic signs of overtraining are more or less fully absent. Here are 3 samples at epochs 50, 100, and 150:

Metrics for the run:

And here is lora error / baseline error, so anything < 1 indicates improvement in error at those timesteps. The improvement minimum at the T=250 level and the T=1000 level are very interesting; also interesting is the increased loss in the earliest timesteps. (Side note: it might be interesting to try taking a lora trained to a state like this, and then continue training it restricted to the first 100 timesteps or so).

TopSalad3530 Apr 23, 2024

The new formulation does prevent NaNs from appearing.

I've started logging the metrics myself, and I've noticed that despite having had 0 weights assigned to them for most of the training process, it looks like skew_loss and kurt_loss have settled into rather reasonable ranges (compared to your results) seemingly by themselves:

I'm logging infrom train_db.py, having switched to DB fine-tuning after merging my previous results into the base model, which means that the numbers charted here are after multiplication with the hyperparameters (currently same as your latest ones), but still these numbers don't seem to be bad at all even if you divide the weights out. Maybe std is all you need?

(My previous model was a LoKR. Initially, I made a mistake by using merge_lora.py instead of the appropriate version from LyCORIS. This caused the model to not-really-get-merged, and I didn't notice because I didn't sample_at_first. After I got the correct data though, it seemed that my original post still applied, so I've restored it.)

cheald Apr 23, 2024

I definitely think that std is the most impactful of the stastical losses. kl_div really only seems to work well with a very low sigma for the soft histogram (which destroys most of the shape of the histogram anyhow), but does seem to learn to converge std to true. However, the gradients for the soft histogram add a decent amount of memory pressure, which is proving to be a problem for me WRT SDXL.

Trying to keep skew and kurtosis in line seems to be mostly counterproductive, judging from my continued experiments.

drhead · 2024-04-22T17:25:11Z

drhead
Apr 22, 2024

Hey, you might want to check out this recent nvidia paper: https://research.nvidia.com/labs/toronto-ai/AlignYourSteps/

It looks like it could effectively be a way to handle this issue on the inference end -- or at the very least, you could gain some useful insights that are relevant to your problem by reading the paper and experimenting with the schedule they have (keeping in mind that it is dataset- and model-dependent). They unfortunately haven't released "training" code (this isn't really training, it's a zeroth-order optimization), but I've collaborated with someone to replicate what I am pretty sure is a valid and correct implementation, and I am experimenting with it.

1 reply

cheald Apr 22, 2024

I actually saw this one, but initially skipped over it because it was inference-focused. However, reading over it in light of what I've been doing lately, there might be some good stuff to learn. Thanks!

doctorpangloss · 2024-04-22T18:42:15Z

doctorpangloss
Apr 22, 2024

@cheald thanks for your diligent work on this avenue of research

I'm also training with masked loss, which should, in theory, disregard the background of my image entirely.

based on my empirical results and my interpretation of what is being trained in particular by LoRA fine tunings, masked loss does not ignore backgrounds. It will learn this aka for a region of noisy pixels, a function that will make that region trend towards "gray".

Another interpretation is that error is increasing for the backgrounds after every iteration and simply being ignored.

My hypothesis for why masked training works well is that for many of the subjects trained by the community, the number of steps needed to achieve decent results does not add "too much" error to backgrounds. Specifically, the conditional Unet LoRAs for a face that already looks like a celebrity will have small changes from identity (aka 0s) for good performance, and near small untrained / random values, the amount of "damage" done to the weights computing backgrounds is relatively small. If you use masked training for many other subjects it tends to blow up backgrounds simply because the parameters via whatever fine tuning method have to "actually" learn something.

10 replies

recris Apr 23, 2024

I've experimented with normalized mask loss before, didn't observe a significant difference.

cheald Apr 23, 2024

My thought is that it might matter more with multi-objective loss, since the ratio between losses does matter.

recris Apr 23, 2024

Side note: the same Lora on SDXL Lightning does not produce the same weird colors and textures as regular SDXL, but shapes in the background tend to lack detail and are always blurry / foggy.

cheald Apr 23, 2024

It might be worth trying a training phase with the std_loss_weight reduced, so that the model isn't forced quite so heavily into predicting noise of noise.std(), but is still pulled in the right direction.

cheald Apr 24, 2024

FWIW, I'm trying the following alteration, which seems to be working for background preservation.

def apply_masked_loss(loss, batch):
    #...
    return loss, mask_image

noise_mask = torch.ones_like(noise, device=noise.device)
if args.masked_loss:
    loss, noise_mask = apply_masked_loss(loss, batch)
# ...
pred_std, pred_skews, pred_kurtoses = train_util.noise_stats(noise_pred * noise_mask)
true_std, true_skews, true_kurtoses = train_util.noise_stats(noise * noise_mask)

I don't have a clear mental model to explain if it's the right approach, but it should at least fix the problem of the network "patching" statistical loss by doing more extreme stuff in the masked areas. Anecdotally, I'm seeing far better background cohesion, though the rate of learning the subject/foreground seems to have slowed a bit.

cheald · 2024-04-24T08:52:30Z

cheald
Apr 24, 2024

Observations from tonight's tests:

I do think that adding std() by itself as the objective loss is sufficient to achieve the desired effects
Masking noise/noise_pred prior to computing std loss substantially helps with background fidelity retention
std() loss successfully counteracts the loss of subject fidelity introduced by higher noise_offset or multires_noise_discount values. This permits for using further-off-mean noise while also not losing subject fidelity (which can be an issue if noise tensors are offset too far right now). This can substantially improve the dynamic range (and possibly generalizability?) of a trained lora. For example, I was able to push multires_noise_discount up to 0.55 (where I've really only been able to use ~0.2-0.3 successfully before subject quality suffers).

I've updated my branch with most of those changes, and am very pleased with the results.

As an aside: charting the effects of pyramid_noise_like, the effect of the discount factor is exponential in nature. The documentation recommends as high as 0.8, which is far too high, IMO, and the rate of change in the 0.4..0.6 range is very substantial. It would probably be worth reformulating so that the hyperparameter has a roughly linear relationship to effect, rather than exponential, which makes it fairly difficult to tune.

0 replies

doctorpangloss · 2024-04-26T20:18:24Z

doctorpangloss
Apr 26, 2024

I am trying to understand: how does the masking, which is in 2D pixel space, get expressed in the latent space?

9 replies

doctorpangloss Jun 21, 2024

and it'll learn that it's allowed to do anything it wants with the rest of it in order to satisfy the prediction of the unmasked regions.

yes, but what people really want is "preserve the distributions that touch those regions," not "ignore those regions." the loss for masked regions is wrong. the reason it's learning "zeroes" is because the distribution (with fuzzy glasses) looks like a normal curve, but the answer that spends the fewest parameters to answer that (i.e., the slope of the gradient will be most steep towards) will be something that writes the mean, aka 0, to those spots. i think this is why we're observing brown.

AI-Casanova Jun 21, 2024
Author

I have a feeling that a masking loss needs to be implemented at the cross attention level, using something like DAAM and forcing the attention out of masked territory.

doctorpangloss Jun 25, 2024

i agree that "mask attention, not loss," but i have no idea how to do this mathematically...

araleza Jul 7, 2024

Hi there, I just found all these discussions in this single thread. Wish I'd known they were all going on in here, they're interesting.

There are two important facts about masked loss that people don't seem to know, one of which has been touched on here:

The VAE representation generally matches the pixel space representation of the image. But not perfectly. In fact, sometimes stuff that's been masked out a long way away from the valid part of the image to learn gets learned anyway, as it got moved quite far in VAE space. Don't make the mistake that I made of thinking that the VAE being 8x smaller in XY means that each 8x8 group of pixels in image space corresponds to a single pixel in the VAE.

2) Make sure you preserve the color channel in your masked-out areas.

Point 2 is the one that people are missing. I deleted areas of my images in GIMP with the eraser tool, and used those. But I started to notice strange 'black patches' appearing in my trained images. You need to keep the color channel preserved, even when you mask it out!

The two ways to achieve this in GIMP are: firstly, make sure you add an alpha channel to your image before you start erasing bits (with 'Layer|Transparency|Add Alpha Channel' from the menus). Secondly, when exporting as a .png file, there's a checkbox that says to preserve the color channel for fully masked-out pixels, so make sure that's ticked. It's here:

araleza Jul 7, 2024

One more masked loss point: don't cut too close to your character if you're learning a person, because you'll start to get a lot of limb duplication (i.e. three arms / legs). This is because spraying arms and legs all over the place increases the chance of getting an arm or a leg in the small unmasked area, and the extra limbs don't cause any loss since they're in masked out regions.

recris · 2024-04-26T21:57:31Z

recris
Apr 26, 2024

Some observations from a few training runs (SDXL Lora) with very high number of steps, using the proposed changes:

Good: Subject likeness is very good and no usual deep-fry artifacts
Bad: outputs seem to lack detail when compared to the same prompts without Lora being on. This is very reminiscent of the difference in quality from regular SDXL to Lightning
Bad: images have a noticeable "texture" similar to a halftone print pattern, it is present uniformly across the whole image and does not seem to be related to the masked concept(s) during training.
Bad: inference with high CFG values goes wrong often; weird things appear in the image like large blobs of color

At this stage I am unsure of the root causes, it could be a poor choice of hyper-parameters or these could be over-training artifacts beyond what the proposed changes are able to compensate.

3 replies

cheald Apr 26, 2024

Can you provide a sample image with that halftone pattern? I'm interested in looking at what's going on in the latents. How high is a high CFG value? The large blobs of color are likely related to latent clipping, which is a large part of what the std learning is supposed to address. If you have the capability to wire it up, it'd be very interesting to see how much of the latent is being clipped. Essentially during image decoding, rather than clamping, you'd just measure the percentage of pixels which fall outside of the clamp region (that is, x < 0; x > 1). You can also visualize it by just forcing those pixels to green or something. I do that in one of my notebooks via:

def show_decoded(latent, clamp=False, title=None):
    i = decode_full(latent)[0]
    oob_ratio = i[(i<0)|(i>255)].numel() / i.numel()
    plt.title("%s (%2.4f%% OOB)" % (title, oob_ratio*100))

    if clamp:
        i = i.clamp(0,255)
    else:
        r = i.permute(2, 0, 1)
        mask = (r[0]<0)|(r[1]<0)|(r[2]<0)|(r[0]>255)|(r[1]>255)|(r[2]>255)
        r[0][mask] = 0
        r[1][mask] = 255
        r[2][mask] = 0
        i = r.permute(1, 2, 0)
    plt.imshow(i.cpu().byte())

You can try lowering the std_loss_weight, as well, which should allow more of that detail through. I think the correct value for that hyperparam is going to be high enough to prevent "fry", but low enough that it doesn't fully hijack the training process.

Another observation I've had is that mean std, left alone, significantly underestimates true std. I think it'll be worth doing some experimentation to understand the emergency of extreme values in the latent for an overtrained network. I may see about grabbing some heavily overtrained loras from civitai and see if I can learn anything about their output nature.

AI-Casanova Apr 26, 2024
Author

re: patterns and detail

I'd suggest inferring with different weights for up and down blocks (I usually do that as a whole, not individually, because of the curse of dimensionality)

You might find you need to tweak block hyper parameters.

I've also had luck with differentially resizing up and down with SVD.

cheald Apr 26, 2024

Another thing I've been playing with that I think I like so far is what I'm calling "dynamic dims" - basically, rather than using a fixed rank across all layers, I'm passing in a ratio. That is, if I pass in network_rank=10 and network_alpha=1, then for a given layer that is, say, 320 features, it'll use weights of rank 32 and alpha 32; another layer of 1024 features would get a rank 102/alpha 102 weights matrix. The idea here is that we probably want high-rank intermediate matrices for the more complex layers, but too-high rank for the less complex layers could be contributing to training problems. So far I think it is improving my results. The resulting loras are significantly larger, for obvious reasons, but they resize quite well.

To activate it, you just pass dynamic_dim=True as a network arg, ie

network_module = "networks.lora_fa"
network_dim = 10
network_alpha = 1
network_args = [
     "conv_dim=20",
     "conv_alpha=0.1",
     "dynamic_dim=True"
]

diff --git a/networks/lora.py b/networks/lora.py
index d120804..6822ed6 100644
--- a/networks/lora.py
+++ b/networks/lora.py
@@ -36,6 +36,7 @@ class LoRAModule(torch.nn.Module):
         dropout=None,
         rank_dropout=None,
         module_dropout=None,
+        dynamic_dim=False
     ):
         """if alpha == 0 or None, alpha is rank (no scaling)."""
         super().__init__()
@@ -53,7 +54,11 @@ class LoRAModule(torch.nn.Module):
         #   if self.lora_dim != lora_dim:
         #     logger.info(f"{lora_name} dim (rank) is changed to: {self.lora_dim}")
         # else:
-        self.lora_dim = lora_dim
+        if dynamic_dim:
+            self.lora_dim = min(in_dim, out_dim) // lora_dim
+            alpha = alpha * self.lora_dim
+        else:
+            self.lora_dim = lora_dim

         if org_module.__class__.__name__ == "Conv2d":
             kernel_size = org_module.kernel_size
@@ -131,7 +136,7 @@ class LoRAInfModule(LoRAModule):
         **kwargs,
     ):
         # no dropout for inference
-        super().__init__(lora_name, org_module, multiplier, lora_dim, alpha)
+        super().__init__(lora_name, org_module, multiplier, lora_dim, alpha, dynamic_dim=False)

         self.org_module_ref = [org_module]  # 後から参照できるように
         self.enabled = True
@@ -432,6 +437,7 @@ def create_network(
     # extract dim/alpha for conv2d, and block dim
     conv_dim = kwargs.get("conv_dim", None)
     conv_alpha = kwargs.get("conv_alpha", None)
+    dynamic_dim = kwargs.get("dynamic_dim", False)
     if conv_dim is not None:
         conv_dim = int(conv_dim)
         if conv_alpha is None:
@@ -487,6 +493,7 @@ def create_network(
         block_alphas=block_alphas,
         conv_block_dims=conv_block_dims,
         conv_block_alphas=conv_block_alphas,
+        dynamic_dim=dynamic_dim,
         varbose=True,
     )

@@ -773,6 +780,7 @@ class LoRANetwork(torch.nn.Module):
         modules_dim: Optional[Dict[str, int]] = None,
         modules_alpha: Optional[Dict[str, int]] = None,
         module_class: Type[object] = LoRAModule,
+        dynamic_dim: Type[bool] = False,
         varbose: Optional[bool] = False,
     ) -> None:
         """
@@ -886,6 +894,7 @@ class LoRANetwork(torch.nn.Module):
                                 dropout=dropout,
                                 rank_dropout=rank_dropout,
                                 module_dropout=module_dropout,
+                                dynamic_dim=dynamic_dim,
                             )
                             loras.append(lora)
             return loras, skipped
diff --git a/networks/lora_fa.py b/networks/lora_fa.py
index 919222c..441840b 100644
--- a/networks/lora_fa.py
+++ b/networks/lora_fa.py
@@ -37,6 +37,7 @@ class LoRAModule(torch.nn.Module):
         dropout=None,
         rank_dropout=None,
         module_dropout=None,
+        dynamic_dim=False
     ):
         """if alpha == 0 or None, alpha is rank (no scaling)."""
         super().__init__()
@@ -54,7 +55,11 @@ class LoRAModule(torch.nn.Module):
         #   if self.lora_dim != lora_dim:
         #     logger.info(f"{lora_name} dim (rank) is changed to: {self.lora_dim}")
         # else:
-        self.lora_dim = lora_dim
+        if dynamic_dim:
+            self.lora_dim = min(in_dim, out_dim) // lora_dim
+            alpha = alpha * self.lora_dim
+        else:
+            self.lora_dim = lora_dim

         if org_module.__class__.__name__ == "Conv2d":
             kernel_size = org_module.kernel_size
@@ -149,7 +154,7 @@ class LoRAInfModule(LoRAModule):
         **kwargs,
     ):
         # no dropout for inference
-        super().__init__(lora_name, org_module, multiplier, lora_dim, alpha)
+        super().__init__(lora_name, org_module, multiplier, lora_dim, alpha, dynamic_dim=False)

         self.org_module_ref = [org_module]  # 後から参照できるように
         self.enabled = True
@@ -443,6 +448,7 @@ def create_network(
     # extract dim/alpha for conv2d, and block dim
     conv_dim = kwargs.get("conv_dim", None)
     conv_alpha = kwargs.get("conv_alpha", None)
+    dynamic_dim = kwargs.get("dynamic_dim", False)
     if conv_dim is not None:
         conv_dim = int(conv_dim)
         if conv_alpha is None:
@@ -498,6 +504,7 @@ def create_network(
         block_alphas=block_alphas,
         conv_block_dims=conv_block_dims,
         conv_block_alphas=conv_block_alphas,
+        dynamic_dim=dynamic_dim,
         varbose=True,
     )

@@ -779,6 +786,7 @@ class LoRANetwork(torch.nn.Module):
         block_alphas: Optional[List[float]] = None,
         conv_block_dims: Optional[List[int]] = None,
         conv_block_alphas: Optional[List[float]] = None,
+        dynamic_dim: Optional[bool] = False,
         modules_dim: Optional[Dict[str, int]] = None,
         modules_alpha: Optional[Dict[str, int]] = None,
         module_class: Type[object] = LoRAModule,
@@ -889,6 +897,7 @@ class LoRANetwork(torch.nn.Module):
                                 dropout=dropout,
                                 rank_dropout=rank_dropout,
                                 module_dropout=module_dropout,
+                                dynamic_dim=dynamic_dim,
                             )
                             loras.append(lora)
             return loras, skipped

recris · 2024-04-27T10:51:57Z

recris
Apr 27, 2024

Were is an example, taken with model RealVisXL_4.0 (I get the same artifacts with SDXL base).
Prompt: "photo of woman, from behind, looking at the horizon, white t-shirt, highly detailed"

Same prompt, without Lora:

0 replies

recris · 2024-04-27T11:41:12Z

recris
Apr 27, 2024

You can try lowering the std_loss_weight, as well, which should allow more of that detail through. I think the correct value for that hyperparam is going to be high enough to prevent "fry", but low enough that it doesn't fully hijack the training process.

I've tried different runs with std_loss_weight of 1.0, 0.5, 0.1, 0.05 without much difference. Maybe I didn't go low enough.

I also wonder if it would make sense to have some kind of schedule for this parameter, using very low values on low timesteps.

6 replies

cheald Apr 28, 2024

Okay, I see what you mean about the halftone. I've seen that on occasion, as well. It looks like it's, more than anything, just an inordinate focus on the really small details - the goosebumps and hair on the arm look great, but I kinda suspect that whatever is responsible for them being emphasized so clearly is also responsible for the patterning in the hair, and the color mottling in the green background.

The more I play with this, the more convinced I am that just learning std() isn't actually the right fix. We're seeing progress in the right direction, but it's not doesn't exactly what we want. I'm experimenting with alternate loss schedules which more heavily punish outliers. Operating on the theory that "fry" and "color blobs" are caused by the unet learning to predict noise values which functionally put the latent out-of-bounds, I am looking for additional mechanisms to prevent that from happening.

I'm experimenting with alternate loss terms, including (1+l1_loss)^p - 1 and loss / (1+loss^2) (where loss is mae, mse, or pseudo-huber), both of which are showing some promise. The first heavily punishes outliers relative to smaller losses, while the second form accelerates punishment fairly quickly, but then keeps the relative loss punishment growing fairly slowly after the inflection point. So far I think the second form is slightly better. Both forms do cause better std convergence than standard MSE, though it's not very quick in either case.

Different losses have very different absolute loss curves, but I suspect it's more the ratios/curve that matters than the magnitude.

The lack of an empirical measurement for "fidelity" is something of a problem - what I'm looking for is a combination of subject fidelity, balance between detail preservation and not-too-much-detail (like your halftone), composition/prompt following, and dynamic range. Different approaches yield different results for different factors, but I can't seem to find a single approach which improves all of them at a time, and without the ability to measure how much a given approach affects a given factor, it's difficult to nail down exactly what we're trying to improve.

My operating theory is that failures in training are often attributable to the network learning to minimize only the most extreme cases, while letting less extreme failures slide. Or, this causes the network to learn to NEVER try extreme values, which helps minimize mean loss but doesn't necessarily help with the fine details. Learning std() should push the distribution of predicted noise OUT, while MSE is going to push the distribution in.

recris Apr 29, 2024

Another way to "recover" the fine details is to merge Loras:

Train one Lora with a reasonable number of steps, without std_loss - do not over-fit / fry.
Train another Lora with a high number of steps but use std_loss

Then merge them at half-strength each (or use both at inference time). Other strength ratios also work, as long as the sum is not over 100%.

This seems to work well in my experiments, I get much of the fine details back without a perceptible loss in fidelity.

AI-Casanova Apr 29, 2024
Author

@recris

Also, varying strength of LoRA between time steps. (I built one of various A1111 extensions into SDNext natively, I'm sure there's a way in Comfy but I don't use it)

For instance I would start a pose or outfit high in the early steps and taper it down in strength, and vice versa for a character LoRA.

It would be interesting to make a tool that would attempt to post process a LoRA to get that effect without the inference time cost.

cheald Apr 29, 2024

It would be trivial to do with a lora loader which loads a lora and uses apply() and a variant multiplier (replacing forward, as is done in training) rather than merge() (which just mutates the weights of the modified layer), but most implementations use some variation of merging because the original paper makes the point that doing so incurs zero additional inference-time overhead and it just became the default means of doing so.

I wrote https://github.com/cheald/sd-webui-loractl to do exactly that scheduling in A1111, but it doesn't work in Forge/Comfy because they're smarter about lora loading and do it all up front before the inference loop starts.

Something else that would be interesting with such a system would be the ability to schedule multiplier (or alpha, same effect in the end) per layer.

AI-Casanova Apr 29, 2024
Author

@cheald yours is the method I implemented in SDNext w/ credit.

Backend is significantly different because of the diffusers base, but parsing is the same.

cheald · 2024-04-30T05:44:17Z

cheald
Apr 30, 2024

My further experimentation is turning up that:

Minimizing mse(noise_pred.std(), noise.std()) is a "detail down" slider, which can help prevent overtraining, but which causes the model to lose fine detail
Adding a loss = loss + mse_loss.std(dim=(2,3), keepdims=True) term is a "detail up" slider, which substantially helps the model retain backgrounds and fine details, but which can make it more difficult for the model to converge (though it does seem to slow overtraining, too!)

For reasons I don't quite understand yet, increasing the variance of the noise (which is what the first loss term does) causes a loss of detail. This is somewhat confusing to me, because I'd intuitively guess that a wider range of noise would result in a more diverse set of color - and essentially detail. However, it improves - by my eye - subject fidelity. The second term there keeps the variance of any individual channel in the latent in check, which should help clamp down on "wild outlier" values. By itself, it does a really, really good job of causing the model to preserve detail, but hurts in terms of subject fidelity.

(Thought: Perhaps a wider range of negative values is leading to that color flattening? It might be interesting to play with tweaking the positive and negative sides of the noise separately!)

Armed with that information, it might be that we could schedule some combination or crossfade of those terms, with the model learning to push noise_pred variance towards noise.std() early on, and then switching to minimizing channel loss variance later on.

Compressing the noise_pred variance too much results in "hyper-detailed" results, but extending it too far out (even towards 1, which is what true noise variance is) results in the model losing too much detail and becoming very flat. The reason for this is not immediately clear to me; intuitively, I would expect noise_pred most like true noise to produce the highest fidelity results, but it seems that this is not the case. If anyone has a theoretical explanation for that, I'd be very interested in exploring what's going on there.

This does kind of explain the noise variance across timesteps, though; at high timesteps, the variance learned is closer to true, and produces "low detail" results, but as you get closer to t=0, the noise_pred variance drops, resulting in increased detail in the output. Alternately, scheduling the various loss factors by some timestep-variant scheduler might result in better overall results, too.

0 replies

cheald · 2024-04-30T16:37:16Z

cheald
Apr 30, 2024

I've got an alternate loss form which is working remarkably well for me, and which I'd like some input on:

alphas_cumprod = noise_scheduler.alphas_cumprod.to(accelerator.device)
ac = alphas_cumprod[timesteps]
mae_loss = F.l1_loss(noise_pred, target, reduction="none")
base_loss = 1/-mae_loss.exp() + 1
loss = base_loss.mean(dim=(2, 3), keepdims=True) * ac
loss = loss + base_loss.std(dim=(2,3), keepdims=True) * (1-ac)

The basic idea here is:

A sigmoid-ish loss function is used, which penalizes large values roughly equally.
At low t, we're using the sigmoid loss
At higher t, we blend over into minimizing the variation of the pointwise sigmoid loss.

The operating theories here are:

Overpenalizing outliers leads to overcorrection. This is fundamentally the idea behind huber loss, I've just taken it to an extreme. The concept here is that anything beyond a certain range of error is roughly the same level of undesirable. I wanted to steer the model away from overcorrecting large mistakes, and just continually guide it towards small ones. In general, this loss term by itself is subjectively preferable to MSE
My observation is that minimizing pointwise loss variance does a great job of maintaining details, though it's less good at learning the subject.
Therefore, this schedule blends raw loss at low t into low loss variance at higher T, allowing higher error as long as it's not producing a wide distribution of errors.

My overnight test results with this are really interesting. I tried both "variance at low t" and "raw loss at low t", and I think the "raw loss at low t" form is generally better, though both are impressive. The suggested formulation here DOES seem to be learning my actual underlying dataset more easily, which suggests better convergence (and perhaps the need to actually reduce my lora rank!)

Results after 50 epochs. Normally, I'd see severe overtraining - visible in the wrinkles around the eyes, and/or the teeth - by now, or a loss of detail and dynamic range. Both seem to be preserved.

12 replies

gshawn3 May 8, 2024

This sounds very promising @cheald, do you happen to have a fork that can be checked out? I've attempted to implement your changes, but I am not entirely confident that I haven't made a mistake somewhere. Especially since I'm leveraging masked_loss and updated the code in sdxl_train.py rather than in train_network.py. Thanks for all your hard work on this!

recris May 10, 2024

@cheald While investigating an issue related to masking I checked the tensor shapes in the proposed code:

print('LOSS SHAPE:', base_loss.shape)
ac = alphas_cumprod[timesteps]
loss = base_loss.mean(dim=(2, 3), keepdims=True) * ac.sqrt()
loss = loss + base_loss.std(dim=(2,3), keepdims=True) * (1-ac).sqrt()
print('LOSS SHAPE 2:', loss.shape)

This prints:

LOSS SHAPE: torch.Size([3, 4, 128, 128])
LOSS SHAPE 2: torch.Size([3, 4, 1, 3])

Given we're doing this before applying masks, wouldn't this break the masking logic? I am surprised it doesn't just explode and actually produces results (but I can see the masks not working well).

cheald May 10, 2024

@gshawn3 Try https://github.com/cheald/sd-scripts/tree/add-std-loss - I think I trimmed out the unnecessary bits. sdxl_train.py subclasses NetworkTrainer from train_network.py, so changes made in train_network.py apply in both.

@recris That's interesting; I would expect it to blow up. It's the ac multiplication doing something funny with the broadcasting. I expected a tensor of [3, 4, 1, 1] there (essentially, the loss per channel per image). This still works because the mask is broadcastable:

>>> alphas_cumprod = torch.ones(t.shape[0])
>>> t = torch.randn((3,4,128,128))
>>> t.mean(dim=(2,3), keepdims=True).shape
torch.Size([3, 4, 1, 1])
>>> (t.mean(dim=(2,3), keepdims=True) * alphas_cumprod).shape
torch.Size([3, 4, 1, 3])
>>> ((t.mean(dim=(2,3), keepdims=True) * alphas_cumprod) * torch.randn((3, 1, 1, 1))).shape
torch.Size([3, 4, 1, 3])

Naively, I suspect that we're essentially multiplying each channel's mean loss by each of the ac values, then ultimately taking the mean of them. This should wash out the effects of using the alpha_cumprod, so correcting it could make things better or worse - I guess we'll see.

To correct it, ac = alphas_cumprod[timesteps].view(-1, 1, 1, 1) should do the trick. I'll try a run with that in place and see how it changes things.

recris May 11, 2024

I think the way this loss function is currently implemented does not play well with masked loss. I've run some tests and seeing concepts (like watermarks) in masked out areas appear during inference. Initially I thought this was insufficient padding in the mask, but even with sufficient padding the issue persists.

The reason is that we are trying to apply the masks after tensor reductions (mean, std), and I don't see how that gets back-propagated correctly to the UNet in the Torch computational graph. The solution is to do the masking earlier, in the output of mse_loss or directly the noise latents.

However that seems to diminish the effect of std loss. I've done a test where the masking was done on mse_loss output, that seems to have solved the concept leakage but the images seem to have less quality in the details. I'm assuming its because the masking lowers mean and variance values and thus the impact in the overall calculated loss?

I've also tried to changed the mean and std calculation to only look at areas with non-zero mask values, but that was proving to be too complicated for me. Right now I am testing a different approach: instead of trying to mask by manipulating the loss, lets simply do it directly in the gradient via register_hook.

mse_loss = F.mse_loss(noise_pred, target, reduction="none")
mse_loss.register_hook(lambda grad: grad * mask)

This way we compute the loss for the whole image, having "correct" mean and variance, and later zero/scale the gradients for masked out elements during backprop.

cheald May 12, 2024

Good call. I was seeing some of that concept leakage as well, and didn't have a good way to account for it. it makes sense that the aggregation functions are "backdooring" some of that information into the backprop.

I didn't realize that we could affect gradients with hooks on tensors like that - very cool. It seems to have worked quite well in a local test! Masking before aggregating results in reduced effects, as you noticed, because all those 0s diminish the contribution of the std loss relative to the proportion of the image that is masked.

If I understand it correctly, I suspect that your approach is probably the ideal one, because we still want information about the distribution of information across the entire predicted latent (to keep the network to learn to avoid predicting extreme values, even in the masked portions), but we don't want it being able to learn what's in those masked portions of the sample images through leaked gradients.

DarkAlchy · 2024-06-05T22:15:51Z

DarkAlchy
Jun 5, 2024

What I do not get is how I can have average to fantastic loss YET nothing learns (lora/locon/loha/dreambooth) the data?

0 replies

gesen2egee · 2024-06-18T03:48:34Z

gesen2egee
Jun 18, 2024

Do you want to try this method?
#1375

0 replies

cheald · 2024-06-18T22:32:33Z

cheald
Jun 18, 2024

I'm back, and I think I have some really cool stuff to share.

SD 1.5 samples, 20 steps, euler a

I've been tinkering with this over the last few weeks, and I think I've gained some genuine insight into the problem that has massive implications for training fidelity and convergence time.

TL;DR: match the per-step standard deviations and means of your input noise per channel to the model you're trying to train, and things happen.

A huge portion of the variant loss is actually legit - it's caused by the fact that we generate noise at (mean=0, std=1), but for whatever reason, the model learns to predict noise which consistently has a std not of 1, especially closer to t=0.

If we actually measure the per-channel std and noise of noise predictions made by the unmodified model, some obvious patterns apply. This appears to vary per-model and per-architecture.

Realistic Vision 5.1 (SD1.5):

SDXL models are wildly different:

DreamshaperXL

It was clear that the standard training mechanism isn't fully correct - in particular, we know that Stable Diffusion, left unmodified, learns a mean of 0 for images, resulting in need for the famous "noise offset" regularization scheme. This has been corrected in downstream models (like RV), which include noise offsets, but we still train with a "blind" noise offset, which may or may not match the underlying model. Worse yet, the actual true mean and std varies per channel and per timestep.

The good news is that it appears to broadly follow a curve. This means that we can measure and interpolate, and use those observed values to affect training.

I'm playing with this at two locations:

Altering the std and mean of noise generated before forward noising images. This has the most extreme effects, and I think essentially will fully replace things like noise offsets, ip_gamma, etc. For SD1.5, actually matching the underlying model mean/std at full strength (for RV5.1, at least) causes a massive over-attention to detail. I've settled on something in the range of [0.175, 0.5]..[0.3, 1.0] for my SD1.5 experiments, and [1.0, 0.5] for SDXL experiments.
Using this as an additional loss term. Multi-task loss is a sticky problem, and correctly weighting the individual tasks is a hyperparameter problem that unfortunately dramatically expands the search space. However, for SD1.5, I've had the best results either matching the loss weights to the noise weights or using small loss weights. SDXL seems to be much more forgiving and is fine with loss weights of [1.0, 1.0]

Rule of thumb: Higher std weight = more detail. Too much = "oversharp". Higher mean weight = more light/shadow depth. Too much = color imbalance and contrast blowout."

I've got this implemented in my autostats branch if you want to try it. The important part is the addition of the a couple of parameters, which indicates that the process should collect model noise statistics, and persist it to a file. This is done so that subsequent runs can just reuse the collected stats.

(This could probably be extracted to a separate utility script with a minimum of fuss, too.)

# New config parameters

## Path where the stats will be saved and loaded from. I is recommended that you keep a separate
## stats file per model you train against.
autostats = "/path/to/my/training/dir/realisticVisionV51_v51VAE.stats.safetensors"

## Weights for the input noise and loss target
autostats_true_noise_weight = [0.3, 0.5]  # The weights weight of the noise std/mean offset
autostats_loss_weights = [0.3, 0.5]       # The weights of the std loss/mean loss towards total combined loss

## Loss weight decay
## The loss targets preserve detail and color depth, but appear to inhibit the fine details of subject fidelity.
## By decaying the loss target over time, you can eventually let the standard loss fully take over to fine-tune
## the likeness after all the large features have been learned with std/mean preservation.
autostats_decay_rate = 0                  # Rate by which to decay the loss function weighting. Use e^(-rate * step) to compute the final weight at a given step. Use 0 for no decay.
autostats_effect_min = 0.05               # If using decay, the minimum weight to decay to.

If you specify an autostats file, and it doesn't exist, then before training, using a predefined set of prompts and timesteps, it runs forward inference for each of those prompts, and at each step, it observes the per-channel distribution and mean of the predicted noise. It averages all of those observations by step and channel and saves. We then load the information, interpolate it across all 1000 timesteps, and then we use that information to construct true noise and loss targets for forward noising during training.

This currently runs 16 inferences at 64 (non-uniform) steps per inference, which is probably higher resolution than is needed, but provides robust numbers. This only takes a couple of minutes on SD1.5, but takes 15-20 minutes for SDXL models on my RTX 3090. However, it only has to be done once per model. More inferences and more varied prompts (particularly with prompting for various levels of detail and light/shadow) may help improve stats collection, but I haven't played with it too much.

@recris This series of experiments did very clearly identify the cause of the halftone pattern. In SDXL models, the lower you drag the mean of channel 3 of the latent, the more detail you get, but you ALSO get the halftoning. Lower channel 3 (the last one, that is, the channels are numbered [0, 1, 2, 3]). manifests largely as a "sharpen" slider, and too sharp results in imagined detail like what you're seeing in your examples. It is easily corrected by dampening how far channel 3 is pulled from a mean of 0.

Also in SDXL, channel 0 is largely "luminosity"; increasing the mean brightens the image, and reducing it darkens it. As you get more extreme, this tends to have "over-contrasting" effects. If you just want to increase color depth, jittering the mean of channel 0 noise (ala noise_offset) would probably achieve that effect without having to mess with the other channels.

Here's an example of an SDXL training (1 epoch) with mean weights at [1.0, 1.0] (high detail) and [0.5, 0.5] (lower detail). You can see that the halftone pattern shows up in the high detail (1.0) image. There is probably a happy medium, but we can also do something like scheduling the mean weights to retain most of the detail, and then "dampen" it towards the end of training.

Kazam_screencast_00007.mp4

I have successfully mitigated this in SDXL by dampening the effects of the early-timestep mean offsets on channel 3:

            if self.is_sdxl:
                ts = 500
                mean_target_by_ts[:ts, 3] = mean_target_by_ts[:ts, 3] * torch.arange(0, 1.0, 1 / ts, device=mean_target_by_ts.device).view(-1, 1, 1)

This very much does seem to prevent the halftone over-sharpening while maintaining the majority of the detail:

However, this is SDXL-specific, and I didn't love adding it as a general term. I might still add it as a SDXL-specific term, but it feels like a hack that wallpapers over some deeper understanding of what's happening.

15 replies

AI-Casanova Jun 28, 2024
Author

Tangential to per channel mean, I just tried some training (human subject) with the SDXL channels multiplied by [2,0,0,2] and got rather fantastic results, with the initial assumption that skin tone would be inferred born out in practice.

cheald Jun 28, 2024

That's a very interesting result. I would expect channel multiplications that extreme to have negative effects. I'd love to see specifics on your setup for that.

What kind of training was it? Lora or Dreambooth? If Lora, what kind of network were you using?

One thing I've discovered is that Dreambooth training using kd_div(pred, target) for std normalization doesn't result in the same flatness/loss of detail that the same does in Lora training, which made me suspect that Loras are missing training for some layers. Sure enough, the default implementation for lora networks misses four layers in the unet:

['lora_unet_time_embedding_linear_1_', 'lora_unet_conv_in_', 'lora_unet_conv_out_', 'lora_unet_time_embedding_linear_2_']

This is because it's looking for Linear/Conv2d layers which are children of particular children of the unet, and both the timestep embedding and unet_conv_in/out layers exist as direct children of the UNet2DConditionModel module. Dreambooth training doesn't suffer from this, as it's just training the unet directly, obviously.

It's worth noting that I think that Lycoris networks do NOT suffer from this bug (they catch those via target_replace_names), but I haven't empirically verified that yet.

This has got me rethinking a number of my prior observations, in particular to what degree the compensations I've been making are essentially just mechanisms for overcoming the lack of training those layers. I've been training with the sd-scripts native lora and lora_fa implementations, both of which do NOT train those four layers.

Of additional note, both kohya and lycoris networks miss the GroupNorm layers which contain trainable parameters when affine=True. I'm going to try training them, as well.

AI-Casanova Jun 28, 2024
Author

I was training with One Trainer, (traitorous I know) with a LoRA - embedding simultaneous PTI, which is good enough to make the switch for.

I was not particularly rigorous, hence my rather evidence free post, but I was surprised enough to offer it up as food for thought.

I could have used [1,0,0,1] and doubled the lr, but I was trying to avoid a hyper parameter search on a rented GPU.

cheald Jun 28, 2024

I get it, PTI is something I really wish the Kohya trainer had! I plan to take a stab at it at some point. That is an extremely interesting result, though. I'll have to take a look at their lora implementation to see if they caught those layers that kohya misses.

recris Jun 28, 2024

Some time ago I've experimented with applying per-channel scales to the loss function, but didn't get meaningful results. I was trying to check if we could perform "colorblind" training given the SDXL latent structure.

AI-Casanova · 2024-07-04T02:15:53Z

AI-Casanova
Jul 4, 2024
Author

Potentially of interest here

https://arxiv.org/abs/2407.03297

3 replies

cheald Jul 4, 2024

Definitely of interest.

In practical training, directly modifying p(λ) to concentrate computational resources on training specific noise levels is more effective than enlarging the loss weight on specific noise levels. Therefore, we focus on how to design p(λ).

This is basically "timestep scheduling is more efficient than loss reweighting", which I think we've independently corroborated earlier in this thread. However, their approach is to actually modify the noise schedule:

"alphas" is the amount of original image at timestep t, sigmas is the amount of noise.

I don't think this works as a drop-in adjustment for lora training. Just adjusting get_noise_noisy_latents_and_timesteps:

def laplace_noise_schedule(mu=0.0, b=0.5):
    lmb = lambda t: mu - b * torch.sign(0.5 - t) * torch.log(1 - 2 * torch.abs(0.5 - t))
    snr_func = lambda t: torch.exp(lmb(t))
    alpha_func = lambda t: torch.sqrt(snr_func(t) / (1 + snr_func(t)))
    sigma_func = lambda t: torch.sqrt(1 / (1 + snr_func(t)))

    return alpha_func, sigma_func

alpha_func, sigma_func = laplace_noise_schedule(0, 0.75)

def get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents):
   # ...
   # replace the forward diffusion with the laplace schedule 
   noisy_latents = alphas * latents + sigmas * noise
   # ...

After running this for 10 epochs, I'm getting very cartoonish results. I suspect it might work really nicely for training a model from scratch, but it probably isn't gonna place nice with an existing model trained with the standard DDIM noise schedule.

However, we could probably use this to derive a timestep probability distribution rather than an alternate noise schedule, and use that for the timestep weighting.

Given p(lambda) as given in the paper, and lambda = log(snr) = log(alphas_cumprod / (1-alphas_cumprod)) and reformulating it in terms of t, we get:

Therefore our timestep distribution schedule is:

mu = 0
b = 0.75

snr = (alphas_cumprod) / (1-alphas_cumprod)
schedule = ((snr.log() - mu).abs() / -b).exp() / (2 * b)
schedule = schedule / schedule.sum()
plt.plot(schedule)

This does share some obvious traits with the timestep schedules discussed earlier in the thread. Playing with different mu/b values makes the impact of each parameter obvious: higher mu moves the distribution leftwards, and a higher b flattens the distribution.

I've had a training going while I typed this up, and experimentally from just a single run's observations, this timestep distribution is producing excellent results. Loss is significantly lower with the adjusted timestep schedule than with directly fudging the noise schedule:

Here's the diff for timestep scheduling. The timestep schedule is hardcoded in get_noise_noisy_latents_and_timesteps right now, but it would be easy enough to hoist that up.

diff --git a/library/train_util.py b/library/train_util.py
index 7e183bc..c96c3b6 100644
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -2456,7 +2456,7 @@ def load_arbitrary_dataset(args, tokenizer) -> MinimalDataset:
     return train_dataset_group


-def load_image(image_path, alpha=False):
+def load_image(image_path, alpha=False):
     try:
         with Image.open(image_path) as image:
             if alpha:
@@ -5086,14 +5086,25 @@ def save_sd_model_on_train_end_common(
             huggingface_util.upload(args, out_dir, "/" + model_name, force_sync_upload=True)


-def get_timesteps_and_huber_c(args, min_timestep, max_timestep, noise_scheduler, b_size, device):
+def get_timesteps(args, noise_scheduler, probs, b_size, device):
+    min_timestep = 0 if args.min_timestep is None else args.min_timestep
+    max_timestep = noise_scheduler.config.num_train_timesteps if args.max_timestep is None else args.max_timestep
+    if probs is not None:
+        probs = probs[min_timestep:max_timestep].float()
+        probs = probs / probs.sum()
+        cat = torch.distributions.Categorical(probs=probs)
+        timesteps = cat.sample([b_size]).to(device=device) + min_timestep
+    else:
+        timesteps = torch.randint(min_timestep, max_timestep, (b_size,), device=device)
+    return timesteps.long()
+

+def get_timesteps_and_huber_c(args, timesteps, noise_scheduler):
     # TODO: if a huber loss is selected, it will use constant timesteps for each batch
     # as. In the future there may be a smarter way

     if args.loss_type == "huber" or args.loss_type == "smooth_l1":
-        timesteps = torch.randint(min_timestep, max_timestep, (1,), device="cpu")
-        timestep = timesteps.item()
+        timestep = timesteps[0].repeat(timesteps.size(0))

         if args.huber_schedule == "exponential":
             alpha = -math.log(args.huber_c) / noise_scheduler.config.num_train_timesteps
@@ -5107,9 +5118,7 @@ def get_timesteps_and_huber_c(args, min_timestep, max_timestep, noise_scheduler,
         else:
             raise NotImplementedError(f"Unknown Huber loss schedule {args.huber_schedule}!")

-        timesteps = timesteps.repeat(b_size).to(device)
     elif args.loss_type == "l2":
-        timesteps = torch.randint(min_timestep, max_timestep, (b_size,), device=device)
         huber_c = 1  # may be anything, as it's not used
     else:
         raise NotImplementedError(f"Unknown loss type {args.loss_type}")
@@ -5117,9 +5126,19 @@ def get_timesteps_and_huber_c(args, min_timestep, max_timestep, noise_scheduler,

     return timesteps, huber_c

-
 def get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents):
     # Sample noise that we'll add to the latents
+
+    alphas_cumprod = noise_scheduler.alphas_cumprod.to(latents.device)
+    mu = 0
+    b = 0.75
+    snr = (alphas_cumprod) / (1-alphas_cumprod)
+    timestep_probs = ((snr.log() - mu).abs() / -b).exp() / (2 * b)
+
+    # Sample a random timestep for each image
+    b_size = latents.shape[0]
+    timesteps = get_timesteps(args, noise_scheduler, timestep_probs, b_size, latents.device)
+
     noise = torch.randn_like(latents, device=latents.device)
     if args.noise_offset:
         if args.noise_offset_random_strength:
@@ -5132,12 +5151,7 @@ def get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents):
             noise, latents.device, args.multires_noise_iterations, args.multires_noise_discount
         )

-    # Sample a random timestep for each image
-    b_size = latents.shape[0]
-    min_timestep = 0 if args.min_timestep is None else args.min_timestep
-    max_timestep = noise_scheduler.config.num_train_timesteps if args.max_timestep is None else args.max_timestep
-
-    timesteps, huber_c = get_timesteps_and_huber_c(args, min_timestep, max_timestep, noise_scheduler, b_size, latents.device)
+    timesteps, huber_c = get_timesteps_and_huber_c(args, timesteps, noise_scheduler)

     # Add noise to the latents according to the noise magnitude at each timestep
     # (this is the forward diffusion process)

cheald Jul 4, 2024

Just eyeballing, mu=1.5, b=2.5 gets us a distribution in the neighborhood of the observed distribution from #1375

Anzhc Jul 4, 2024

Let me give you more accurate distribution that was accumulated over ~130k real steps at max decay of 0.25 (with default being 0.50 max).

Increasing "resolution" by decreasing approximation level seems to lead to development of large amount of excavations, which suggests to me that ideal distribution might not be smooth.

Also btw, i did implement loss map loading, but results from other people who tested training with it suggests that learning distribution from uniform leads to better result, at least for anime training which we all were doing(on Ponyv6). At least it is the case for very small scale (~30 images), but i can't suggest that it is so for realistic, you would know that one better, i very rarely tackle real images.
Ultimately we decided to roll back to initialization from uniform map with values of loss at 1.0, so it has initialization phase where it learns distribution by itself.

Also just for more data, here is graphs at current last step(~130k), ~90k, ~60k and ~30k.

Im only approaching middle of training right now, so i will post graphs at later stages in couple days i guess.

And yes, those distributions are mostly valid specifically for Ponyv6, as i don't train on other models currently. I'll attach current last step probabilities, if anyone wants to do anything with them. It's just plain chance values from first to last timestep.
run_probabilities_23618.txt
And one at 6000 steps:
run_probabilities_1000.txt

cheald · 2024-07-07T05:00:15Z

cheald
Jul 7, 2024

I've gone through a whole host of experiments this weekend, but the most promising is this: adding a loss function for the norm of the text encoder conds REALLY helps.

I'm experimenting with SD1.5 (Realistic Vision 5.1, specifically). I've been chasing a whole bunch of various manipulations of the noise, but I've been unable to come up with fixes which generalize. But I started messing with the text encoder, and things clicked.

For some conceptual overview:

Text captions are turned into input IDs, which are then turned into an embedding in SD's 768-dimensional embedding space. You can think of this embedding as a vector which points to a location in 768-dimensional space, and generally, you can measure how closely two concepts are related by measuring the cosine similarity (angle) between them in the embedding space.
The norm of the embedding vector its its length. To conceptualize it in 2D, a particular vector's norm would be the radius of a circle, and the two vector values would point to X/Y coordinates which map to a point on that circle. The dot product of the vectors gives you the angle - the "direction" of the vector, and the norm gives you the length.
Two embeddings with similar dot products (directions) are more conceptually related, but in SD, the length (norm) of the vector seems to make a big difference.
Extending the length of an embedding vector makes the resulting image more highly contrasted
Reducing the length of am embedding vector makes the image less constrasted.

Okay, so with that theory understood, during training, we embed a given caption, then we get a loss value from the unet that tells us how far away we were from correctly predicting the noise. The unet's prediction is conditioned on the text encoder output, so if you're training the text encoder (or a Lora which modifies it), this causes the training loop to change the embedding that the text encoder produces to try to give the unet updated conditionals which improve its guess next time.

If left unconstrained, the text encoder can learn to improve loss by pushing the embedding for a given caption out of the "highly populated" concept space into a more unique part of the embedding space which the unet can more easily learn. The problem with this, though, is that SD seems have a fairly narrow "aesthetic" range of embeddings clustered around vectors of a certain length. By learning an embedding further out of the "normal" range, we can more easily reduce loss (because it's learning an uncontested part of the embedding space with less concept bleed), but it does it by removing the embedding further from our learned concepts.

Here's some examples using a simple custom ComfyUI node that I can use to extend or shrink the length of an embedding:

beautiful scenery nature glass bottle landscape, purple galaxy bottle

At "natural" embedding length:

Embedding * 0.5 (same direction - pointing at the same concepts - but only half the length)

Embedding * 0.75

Embedding * 1.25

Embedding * 1.5

You can see that we're keeping all the same concepts, but changing the vector length has marked impacts on both prompt cohesion and aesthetic quality.

I think this is a large part of the problem with training subjects which don't look like existing subjects in the model. The trainer learns:

This is a human, so the angle of the embedding should be in the "human subject" territory
But it has trouble reducing loss when the length of the embedding vector is too close to the previously-learned humans
So it learns to push the length of the vector out into a less-populated space
But this has the effect of pushing it closer to "fringe" concepts which are less aesthetic.
You end up training a distorted or mutated subject.

This is desirable behavior when training a new model, or when fine-tuning it on a lot of novel data, but this is less desirable when just trying to integrate a new subject. But, we can easily tell SD to "learn this subject, but keep it aesthetic" by just constraining the TE norm!

# Prior to the training loop
def embed_caption(captions):
    return get_weighted_text_embeddings(
                        tokenizer,
                        text_encoder,
                        captions,
                        accelerator.device,
                        args.max_token_length // 75 if args.max_token_length else 1,
                        clip_skip=args.clip_skip,
                    )

all_caps = []
with torch.no_grad(), tqdm(total=10 * len(train_dataloader), desc="Collecting text encoder stats") as pbar:
    for i in range(0, 10):
        for batch in train_dataloader:
            caps = embed_caption(batch["captions"])
            pbar.update(1)
            all_caps.append(caps)
cap_norm_mean = torch.cat(all_caps).norm(p=2, dim=(-1)).mean(dim=0).unsqueeze(0)

# Inside the training loop
deadzone = 0.0
te_loss_weight = 1.0
te_nrm_loss = (F.mse_loss(text_encoder_conds.norm(dim=(-1)), cap_norm_mean, reduction="none") - cap_norm_mean*deadzone**2).clamp(min=0).mean(dim=1).view(-1, 1, 1, 1)

loss = loss + te_nrm_loss * te_loss_weight

The deadzone parameter I'm playing with because I think I want to allow a "dead zone" around the learned average where a penalty isn't assessed. The te_loss_weight is just a scaling factor; this works well with lower parameters like 0.1. Higher factors will more aggressively force the norm of the embedding vectors towards average, which improves aesthetics but harms subject fidelity.

This graphs the length of the vector between the original embedding of "chris" and the learned embedding. This will include both angular distance (change in concept relatedness) and length.

Here, the red line was a te_loss_weight=1.0, and the line indicates that the embedding stabilizes at a distance of ~8 from original. If you think of the embedding space as a sphere, this means that the "radius" of the sphere is being held constant, and the trainer is finding a new spot on that sphere.

The pink line is a weight of 0.1, and you can see that this distance is divergent (and keeps diverging!) - this weight is probably too low. It's substantially improved the aesthetic quality of my training, but I can probably bring it up a bit.

With a high-enough weight (keeping the embedding norm constrained) you could read that metric as a "how good your captions are" metric, too.

There might also be some gains to be had by combining this with noise std prediction loss terms, but for the time being, I think this shows significantly more promise for resolving trainings for stubborn datasets. I think this might substantially change the recommendations for text encoder/unet learning rate ratios. By restraining the embedding norm, once it's at the right "angle", continued noise losses (due to the unet still learning) won't have the effect of altering the embedding length, and should result in both faster convergence and better aesthetics.

37 replies

IntendedConsequence Jul 14, 2024

@cheald it seems that in here https://github.com/KohakuBlueleaf/LyCORIS/blame/main/lycoris/modules/locon.py#L140 something similar is going on, however compared to what's seen on Ethan's screenshot

LyCORIS version has a lot more things going on. One thing that caught my eye is that their version of learnable zero gate looks a lot like just an extra alpha term? If I'm not missing something here, can it be related to your need to raise the LR of the gate? Like isn't this zero gate just a learnable alpha then?

And speaking of alpha, I vaguely remember reading that its purpose in LoRA modules was to combat learned weights being too small values by scaling them. I may be just reading tea leaves here, since I don't have any understanding of the matter beyond monkeying around with LoRA training, but it reminds me of your recent experiment with training the te norms, where you had to train them in full fp32 precision to circumvent the vanishing gradient. Does alpha scaling work for gradients during training too?

cheald Jul 14, 2024

Yes, it looks like their scalar is exactly as you describe. I actually switched internally to just using a learnable alpha entirely. I personally think that including the alpha parameter in the end-user version of the LoRA params was a mistake; it more or less only really functions as a defacto scale on the LR. Setting alpha to lora_dim (so that the scale factor is 1) and adjusting the TE LR is significantly more interpretable, IMO. That said, even with that in play, I needed a significantly larger LR for those gate/learnable alpha params.

I did learn what was going on with my weights; I was adding them with the underlying norm weights which was in either fp16 or fp8 (using fp8_base), which was implicitly converting them to those dtypes and nuking the gradient. Casting those base tensors to the w_norm/b_norm tensor dtype first preserved the gradients. Noob error on my end.

IntendedConsequence Jul 14, 2024

How did you end up with fp8 weights?

cheald Jul 14, 2024

I'm using fb8_base, which converts the base models into fp8 when possible. I'm training locally on 3060, and that VRAM tends to matter when I'm experimenting with new sources of gradients :)

recris Jul 26, 2024

I've done some tests using TE norm loss in SDXL, here are my notes:

Given we have 2 text encoders I used loss = base_loss + (te1_nrm_loss + te2_nrm_loss) * te_loss_weight
Got very good results with te_loss_weight = 0.2 and deadzone = 0.05
I've done very limited tests combining TE norm loss with mean+std loss, inconclusive so far (I haven't run a test with very high step count, it takes too long on my HW)

DarkAlchy · 2024-07-07T13:25:36Z

DarkAlchy
Jul 7, 2024

Until we get this in the Kohya trainer, which I highly doubt Kohya will do much with the trainer these days requiring this level of work, all moot, sadly.

4 replies

AI-Casanova Jul 7, 2024
Author

sd-scripts is not the only trainer, Kohya is not the only coder, defeatism is never the answer.

madman404 Jul 7, 2024

You're wasting your effort on this guy.

cheald Jul 7, 2024

None of this is moot; there are many immediately applicable lessons in this thread. The goal is to develop a better overall theory of how training behaves in practice, and then to use that to develop improvements which generalize well. Many of the mechanisms proposed so far are easy to integrate and test, and could be developed into a full pull request by anyone reasonably motivated to do so. I'm just personally more interested in experimentation than in landing code upstream.

DarkAlchy Jul 8, 2024

You're wasting your effort on this guy.

Ahhh, the so uneducated speaks again. Stay in your own lane.

"sd-scripts is not the only trainer, Kohya is not the only coder, defeatism is never the answer."

Then you just agreed with me as I said what I said on this thread that is called Kohya. Had this been on another github I would never have mentioned Kohya. I am old school where discussions on a github is about it, especially on a conversation on the kohya github, so pardon me for thinking this was about kohya. Easy mistake.

cheald · 2024-07-18T07:59:54Z

cheald
Jul 18, 2024

A couple of new techniques to try:

Rank estimation via SVD of the model layers

def get_rank(w, cutoff=0.3):
    U, S, V = torch.svd(w.flatten(start_dim=1).to(device="cuda", dtype=torch.float32))
    cumsum = S.cumsum(0) / S.sum()
    rank = (cumsum > cutoff).nonzero()[0].item()
    del S, V, U, cumsum
    return max(rank, 4)

class LoRAModule(torch.nn.Module):
    # ...
    self.lora_dim = get_rank(org_module.weight, target_pct)

Here, target_pct is a percentage of singular values to cover. I'm using values between 0.1 and 0.3. Closer to 1.0 would mean closer to full rank. The idea here is that different layers are going to have different ranks which result in a low-rank approximations of roughly the same resolution. This technique allow for dynamic selection of rank based on the distribution of singular values for that layer's weights in the underlying model. A cutoff of 0.3 means "however many singular values it takes to cover 30% of the matrix's sum of singular values". It's reasonably fast to compute on startup, as long as it's done on the GPU.

At cutoff=0.35, I'm getting (as a quick small subset of layers):

Layer Name	Dim	Alpha	in/out shape	Layer Type	Original Parameters	Lora Layer Parameters
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn1_to_q	82	39.25	640x640	Linear	409,600	104,961
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn1_to_k	81	32.25	640x640	Linear	409,600	103,681
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn1_to_v	80	7.84	640x640	Linear	409,600	102,401
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn1_to_out_0	81	54.25	640x640	Linear	410,240	103,681
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_ff_net_0_proj	132	65.00	640x5120	Linear	3,281,920	760,321
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_ff_net_2	135	64.00	2560x640	Linear	1,639,040	432,001
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn2_to_q	69	13.87	640x640	Linear	409,600	88,321
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn2_to_k	88	-0.74	768x640	Linear	491,520	123,905
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn2_to_v	88	6.84	768x640	Linear	491,520	123,905
lora_unet_down_blocks_1_attentions_0_transformer_blocks_0_attn2_to_out_0	57	14.94	640x640	Linear	410,240	72,961
lora_unet_down_blocks_1_attentions_0_proj_out	150	64.00	640x640	Conv2d	410,240	192,001
lora_unet_down_blocks_1_attentions_1_proj_in	158	45.50	640x640	Conv2d	410,240	202,241
lora_unet_down_blocks_1_attentions_1_transformer_blocks_0_attn1_to_q	92	26.37	640x640	Linear	409,600	117,761
lora_unet_down_blocks_1_attentions_1_transformer_blocks_0_attn1_to_k	84	28.63	640x640	Linear	409,600	107,521

    te parameters: 37,229,385 (97,536 / 0.08% uncovered | 30.25% original size)
unet parameters: 217,541,912 (223,684 / 0.03% uncovered | 25.31% original size)

TE Breakdown
Per layer params:
Layer                           Params            Covered   % of LoRA
Embedding                   11,239,425         37,945,344   4.41%
Linear                      25,989,960         85,017,600   10.20%

Unet Breakdown
Per layer params:
Layer                           Params            Covered   % of LoRA
Linear                      68,096,696        270,209,280   26.73%
Conv2d                     149,445,216        589,088,000   58.66%

These are ~900mb float32 checkpoints (from a SD1.5 model), so obviously on the larger size for a typical LoRA, but the cutoff value could be moved up or down easily enough.

Dynamic alpha and pre-computed per-layer alphas

After the separate layer scale experiments, I've moved back to just training the alpha parameter. If we think of this in terms of a ratio of the lora_dim, then the general algorithm is:

1. Train with unet_lr and te_lr at some reasonable equal value. I'm using 1e-4
2. Train until loss flattens out
3. Pull the alpha values and dims from a saved checkpoint, and compute their ratios
4. Save those ratios and start a new run with each layer's alpha initialized to that ratio * its lora_dim

This gives us a way to actually estimate relative importance per layer. After running for 24 epochs, I get something like this. The blue bar is the final alpha ratio for that layer (essentially, what you would multiply lora_dim by to get the actual alpha for that layer). Orange is the shift from initial (1.0, in this case).

This is interesting for a couple of reasons. First, it's obvious that not all layers are contributing equally to the learning task. Layers under a given final alpha could actually probably be dropped to conserve parameters trained; some experimentation is warranted, but it's likely that we could slim down the layers trained, or reallocate their parameters to the more impactful layers.

Second, and WAY more interestingly, this essentially gives me an estimate for the optimal LR per layer. Remember, my initial LRs were 1e-4 and inital alpha was 1.0. alpha * lr is approximable as the effective learning rate. What this technique does is effectively perform per layer automatic LR adjustment! For example, I can see that my TE layers broadly settle on an alpha of 0.05-0.1, which suggests that the 1e-4 learning rate is 10-20x the ideal LR for these layers relative to the learning rate of the other layers. However, I think this will let me essentially select whatever global LR I want, and to automatically scale layer LRs to each other.

By running multiple 24-epoch trainings, taking the checkpoint from the 24th epoch, computing these alphas, and then using them as the initial alphas for the next run, I get dramatic improvements in training quality without adjusting any other hyperparameters. Additionally, the network gets "good" much, much faster with each iterative adjustment of alphas, and the offsets from inital alphas drop, suggesting that there is in fact an ideal set of alphas per layer.

The general idea here is just:

class LoRAModule:
    # ...
    alpha = ALPHAS.get(self.lora_name, 1.0) * self.lora_dim
    self.alpha = torch.nn.Parameter(torch.tensor(alpha).float())

I'm loading ALPHAS from a JSON file I've got saved. It's not elegant but it works for testing.

I did have to crank the LR for alphas way up - 1000-5000x the base rate (which works out to 0.1-0.5 for my use case). Despite that, the alphas DO converge. After 24 epochs on the second run:

It's worth noting that this is without training the LayerNorm/GroupNorm layers, or the linear/conv2d bias layers. I am training the text embedding layer here, but I'm going to run some tests without it, too.

Here's some quick examples of samples from training. First, two of my ground truth images:

And here are samples from 2 runs, using the learned alphas from run 1 for the initial alphas of run 2. No other parameters were changed between runs.

Columns are samples at epoch 6 and 24, rows are run 1/2.

As with most of these experiments, this introduces a potential confounder for all the other lessons learned; if various training problems are caused by certain layers over or under-training relative to each other, then learning alphas first might mitigate many of them.

0 replies

cheald · 2024-07-31T08:06:04Z

cheald
Jul 31, 2024

I've got one particularly interesting note for people to play with here. tl;dr I think that LoRA training is fundamentally flawed and biases heavily towards doing most of its learning in the largest (by element count) layers of the network.

LoRAs consist of, per layer, a project-down matrix and a project-up matrix, of shapes [in, lora_dim] and [lora_dim, out], respectively. In Kohya, the project-down matrix is initialized with kaiming_uniform noise with a=sqrt(5) - this is a uniform distribtion in the range -0.009..0.009. The up-projection matrix is initialized to zeros, and consequently may have a hard time escaping the zero init due to tiny gradient flow. This is mentioned earlier in the thread, and I think it's very true. An easy fix is to init both lora_down and lora_up with nonzero weights, using a smaller distribution. My intuition is that there is likely an ideal distribution for both such that (lora_up @ lora_down).norm() matches some value, and the uniform range for each is likely inversely correlated to its number of elements.

However, I think we have a much larger problem:

Under the current training mechanism, effective learning rates are inconsistent across layers because of disparity in the sizes of the up and down matrices between layers. This is made worse by the fact that rank selection mutates the effective learning rate of the layer relative to other layers, making it more difficult to evaluate rank selection independently. Let me explain.

First, set aside Adam's learning rate adaptation for the moment (I think that's a whole 'nother mess of trouble). When we fine tune a model with full weights, we start with a weight matrix W, perform a forward pass and backwards pass, then modify W by W = W - W.grad * lr. This is straightforward enough.

However, when we learn a decomposed approximation of $W\Delta$ via matrices A and B, both with the same learning rate of lr, the effective learning rate of the implied weight delta (B@A) is not lr, because $(B - B.grad \times lr) \cdot (A - A.grad \times lr)$ results in a significantly larger change to $W\Delta = B \cdot A$.

The equivalent update to $W\Delta$ would instead be $(B \cdot A)−(B.grad \cdot A.grad) \times lr$, or ideally, a decomposition thereof which is expressed in terms of scaling factors for A.grad and B.grad. This is where I'm stuck and am running up against the limits of ChatGPT-teaches-linear-algebra.

Here's an exaggerated example:

import torch
from torch.nn import functional as F

torch.manual_seed(1)

input_t = torch.randn(960, dtype=torch.float64)
A = torch.randn(64, 960, requires_grad=True, dtype=torch.float64)
B = torch.randn(320, 64, requires_grad=True, dtype=torch.float64)
t = (B@A).clone().detach().requires_grad_(True)

lora_output = F.linear(F.linear(input_t, A), B)
full_output = t @ input_t

print("Matrices match?", torch.allclose(lora_output, full_output))

lora_output.mean().backward()
full_output.mean().backward()

with torch.no_grad():
    lr = 100.0
    t_u = t.grad * lr
    a_u = A.grad * lr
    b_u = B.grad * lr

    print("t_u.norm", t_u.norm())
    print("A_u @ B_u.norm", (b_u @ a_u).norm())

# Matrices match? True
# t_u.norm tensor(176.6249, dtype=torch.float64)
# A_u @ B_u.norm tensor(130243.2076, dtype=torch.float64)

You can see that the implied weight update from the LoRA approach is massively larger than the fine-tune update. This is, in short, because we're stepping both A and B by lr * grad, and then doing a matrix multiplication to get their implied $W\Delta$ update.

In isolation, this isn't a problem - you just pick a new lr that fits your problem domain. However, this is a very big problem for practical LoRA training, because we're training a whole bunch of layers with different geometry and norms. The effect of this is that the matrices which produce gradients with larger norms will make changes to the output of the model at a significantly faster rate - orders of magnitude, perhaps - than the smaller layers. This essentially guarantees that LoRA training will concentrate most of the learning in those large layers, and will overtrain long before the small layer can begin to exert any significant influence.

I'm trying to work out how to compensate for this, but I'm running short of ideas. However, my intuition is that if we can scale the grads of A and B correctly, we can help prevent large layers from dominating training. I've tried just dividing grads by grad.numel() (and grad.numel().sqrt()), but I don't think those are correct yet.

Furthermore, because A and B are different parameters, Adam is going to learn different adaptive learning rates for each of them, which I suspect further muddles the problem. Ideally, we would use an Adam variant which takes its first and second moment estimations from B.grad @ A.grad rather than B.grad and A.grad individually, but that's obviously a nontrivial lift.

4 replies

cheald Jul 31, 2024

Regarding initialization, the Huggingfact PEFT library has two alternate init schemes, OLora and PiSSA.

olora: Performs QR decomposition of the original weights, then uses the top N ranks from them as the lora init.
PiSSA: Performs SVD decomposition of the original weights, then uses the top N ranks from the decomposition as the lora init.

In both cases, the recomposition of the low-rank matrix is subtracted from the base weights prior to starting training. Both techniques should result in a weight initialization which doesn't suffer from the zero-init problem, and which more closely mimics the effects of a full fine-tuning (since W_base + W_delta should =~ W_orig initially). I'm not yet clear how they get a delta from the original weights to save, though; saving just the learned matrices would require a specialized runtime application wherein you first do QR decomposition of the base weights and subtract their scaled low-rank recomposition, then add in the learned weights.

AI-Casanova Aug 1, 2024
Author

LoRA extraction after training would probably be the easiest way.

Merge LoRA to modified weights, subtract original weights, SVD.

AI-Casanova Aug 1, 2024
Author

Actually the PEFT conversion is even easier.

They simply save the initialization and subtract it from the trained weights, eliminating the need for additional decomposition.

cheald Aug 1, 2024

I've done a bunch of poking at the learning rate issue, and I don't think it's analytically tractable.

The step currently taken in the effective weight matrix implied by stepping A and B by their grads is:

$\frac{\partial{L}}{\partial{B}}\eta \cdot \frac{\partial{L}}{\partial{A}}\eta = \frac{\partial{L}}{\partial{B}}\frac{\partial{L}}{\partial{A}}\eta^2$

(The partial derivatives are the grads; $\frac{\partial{L}}{\partial{B}}$ is B.grad; $\eta$ is the learning rate)

Given that $W_\Delta = BA \cdot scale$, if we were to consider the "equivalent fine-tune step update" , $\frac{\partial{L}}{\partial{W_\Delta}}\eta$, we can write it as:

$\frac{\partial{L}}{\partial{W_\Delta}}\eta = scale(\frac{\partial{L}}{\partial{B}}A\eta + B\eta\frac{\partial{L}}{\partial{A}})$, where $scale$ is $\frac{alpha}{rank}$ (and conveniently is just module.scale).

The problem I've been trying to solve is "Using our knowledge of A and B, their gradients, and the optimizer's learning rate, how do we modify the left hand side of this equation to make it true"?

$\frac{\partial{L}}{\partial{B}}\frac{\partial{L}}{\partial{A}}\eta^2 \approx scale(\frac{\partial{L}}{\partial{B}}A\eta + B\eta\frac{\partial{L}}{\partial{A}})$

The short version is that because A and B aren't invertible, it can't be done analytically. We can estimate it using pseudoinverses (or by just decomposing the implied W.grad weight update as given above via SVD), but pseudoinverses are done with the SVD, which runs in O(n * m^2) time, so I think that's probably a non-starter for every parameter on every update! The best we can do is to try to estimate the relative contribution of A.grad and B.grad to the final W.grad, and then scale them accordingly. However, even this is only a rough estimation at best.

It's entirely unclear to me whether adaptive learning like Adam helps or hurts here.

As I've thought about this, I think what we're most interested in is matching the magnitude of the effective weight update $\frac{\partial{L}}{\partial{B}}\frac{\partial{L}}{\partial{A}}\eta^2$ to the magnitude of $\frac{\partial{L}}{\partial{W_\Delta}}\eta$. The goal here is to keep effective learning rates roughly consistent across layers; I don't really care about matching the updates exactly.

To that end, given $W_\Delta{grad} = scale(Bgrad \cdot A + B \cdot Agrad)$, we can scale A (down) and B (up) grads by:

$scale_\Delta = \sqrt{\frac{||W_\Delta{grad}||}{||Bgrad \cdot Agrad||}}$

accelerator.backward(loss)
with torch.no_grad():
    for lora in network.modules():
        if hasattr(lora, "lora_down"):
            lr = args.unet_lr if "unet" in lora.lora_name else args.text_encoder_lr
            a = lora.lora_down.weight.flatten(start_dim=1)
            b = lora.lora_up.weight.flatten(start_dim=1)
            a_grad = lora.lora_down.weight.grad.flatten(start_dim=1) * lr
            b_grad = lora.lora_up.weight.grad.flatten(start_dim=1) * lr
            w_grad = (b_grad @ a + b @ a_grad) * lora.scale * lr

            w_grad_norm = w_grad.norm()
            e_grad_norm = (b_grad @ a_grad).norm()
            scale_delta = (w_grad_norm / e_grad_norm).sqrt()

            lora.lora_up.weight.grad.mul_( scale_delta )
            lora.lora_down.weight.grad.mul_( scale_delta )

            a_grad = lora.lora_down.weight.grad.flatten(start_dim=1)
            b_grad = lora.lora_up.weight.grad.flatten(start_dim=1)
            ab_grad = (b_grad * lr) @ (a_grad * lr)

            print({
                "scale_delta": scale_delta.item(),
                "w_grad_norm": w_grad_norm.item(),
                "w_e_ratio": (w_grad_norm / e_grad_norm).item(),
                "e_grad_norm": e_grad_norm.item(),
                "ab_grad_norm": ab_grad.norm().item(),
                "w_ab_ratio": (w_grad_norm / ab_grad.norm()).item(),
            })

This results in logs like:

{'scale_a': 172.0, 'scale_b': 172.0, 'w_grad_norm': 0.00726318359375, 'w_e_ratio': 29568.0, 'e_grad_norm': 2.4586915969848633e-07, 'ab_grad_norm': 0.007232666015625, 'w_ab_ratio': 1.0078125}
{'scale_a': 540.0, 'scale_b': 540.0, 'w_grad_norm': 0.0751953125, 'w_e_ratio': 292864.0, 'e_grad_norm': 2.5704503059387207e-07, 'ab_grad_norm': 0.0751953125, 'w_ab_ratio': 1.0}
{'scale_a': 201.0, 'scale_b': 201.0, 'w_grad_norm': 0.0021514892578125, 'w_e_ratio': 40448.0, 'e_grad_norm': 5.3085386753082275e-08, 'ab_grad_norm': 0.00213623046875, 'w_ab_ratio': 1.0078125}
{'scale_a': 366.0, 'scale_b': 366.0, 'w_grad_norm': 0.00107574462890625, 'w_e_ratio': 134144.0, 'e_grad_norm': 8.032657206058502e-09, 'ab_grad_norm': 0.00107574462890625, 'w_ab_ratio': 1.0}
{'scale_a': 298.0, 'scale_b': 298.0, 'w_grad_norm': 0.00104522705078125, 'w_e_ratio': 89088.0, 'e_grad_norm': 1.1757947504520416e-08, 'ab_grad_norm': 0.00104522705078125, 'w_ab_ratio': 1.0}
{'scale_a': 212.0, 'scale_b': 212.0, 'w_grad_norm': 0.0015716552734375, 'w_e_ratio': 45056.0, 'e_grad_norm': 3.4924596548080444e-08, 'ab_grad_norm': 0.00156402587890625, 'w_ab_ratio': 1.0078125}
{'scale_a': 167.0, 'scale_b': 167.0, 'w_grad_norm': 0.0005950927734375, 'w_e_ratio': 27904.0, 'e_grad_norm': 2.130400389432907e-08, 'ab_grad_norm': 0.0005950927734375, 'w_ab_ratio': 1.0}
{'scale_a': 199.0, 'scale_b': 199.0, 'w_grad_norm': 0.00173187255859375, 'w_e_ratio': 39680.0, 'e_grad_norm': 4.353933036327362e-08, 'ab_grad_norm': 0.0017242431640625, 'w_ab_ratio': 1.0078125}
{'scale_a': 135.0, 'scale_b': 135.0, 'w_grad_norm': 0.005950927734375, 'w_e_ratio': 18304.0, 'e_grad_norm': 3.241002559661865e-07, 'ab_grad_norm': 0.00592041015625, 'w_ab_ratio': 1.0078125}
{'scale_a': 450.0, 'scale_b': 450.0, 'w_grad_norm': 0.0013580322265625, 'w_e_ratio': 202752.0, 'e_grad_norm': 6.693881005048752e-09, 'ab_grad_norm': 0.0013580322265625, 'w_ab_ratio': 1.0}
{'scale_a': 432.0, 'scale_b': 432.0, 'w_grad_norm': 0.00124359130859375, 'w_e_ratio': 186368.0, 'e_grad_norm': 6.664777174592018e-09, 'ab_grad_norm': 0.00124359130859375, 'w_ab_ratio': 1.0}
{'scale_a': 86.5, 'scale_b': 86.5, 'w_grad_norm': 0.0021820068359375, 'w_e_ratio': 7520.0, 'e_grad_norm': 2.905726432800293e-07, 'ab_grad_norm': 0.002166748046875, 'w_ab_ratio': 1.0078125}
{'scale_a': 136.0, 'scale_b': 136.0, 'w_grad_norm': 0.0021820068359375, 'w_e_ratio': 18432.0, 'e_grad_norm': 1.1874362826347351e-07, 'ab_grad_norm': 0.002197265625, 'w_ab_ratio': 0.9921875}
{'scale_a': 250.0, 'scale_b': 250.0, 'w_grad_norm': 0.0184326171875, 'w_e_ratio': 62720.0, 'e_grad_norm': 2.942979335784912e-07, 'ab_grad_norm': 0.018310546875, 'w_ab_ratio': 1.0078125}
{'scale_a': 183.0, 'scale_b': 183.0, 'w_grad_norm': 0.0023956298828125, 'w_e_ratio': 33536.0, 'e_grad_norm': 7.12461769580841e-08, 'ab_grad_norm': 0.0023956298828125, 'w_ab_ratio': 1.0}

The key thing here is w_ab_ratio, which shows that we're achieving a final norm of $||\eta Bgrad \cdot \eta Agrad|| \approx ||\eta Wgrad||$

Does any of this matter? Under Adam, the scaled gradients achieve marginally lower losses (~0.2% at 50 steps, ~0.4% at 100 steps in my test setup), and don't have any visible impact at earlier epochs, but at later epochs, I do start to see significant divergences. For example, here's 20 epochs:

To my eye, detail and fidelity are marginally better.

recris · 2024-09-13T09:57:39Z

recris
Sep 13, 2024

Recently I came across this approach to calculating loss with multiple objectives: https://github.com/TorchJD/torchjd

I wonder if this could improve results when combining MSE loss with std loss instead of a straight sum.

Reddit discussion: https://www.reddit.com/r/MachineLearning/comments/1fbvuhs/r_training_models_with_multiple_losses/

0 replies

Everything you know about loss is a LIE! #294

Replies: 28 comments · 154 replies

AI-Casanova Mar 29, 2023 Author

AI-Casanova Oct 22, 2023 Author

AI-Casanova Jun 21, 2024 Author

AI-Casanova Apr 26, 2024 Author

Replies: 28 comments 154 replies

AI-Casanova
Mar 29, 2023
Author

AI-Casanova Oct 22, 2023
Author

AI-Casanova Jun 21, 2024
Author

AI-Casanova Apr 26, 2024
Author