Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

artificialguybr · 2024-01-13T17:44:10Z

Describe the bug

Using the script located in
diffusers/examples/research_projects/diffusion_dpo to try a DPO on my Stable DIffusion XL model I've seen a few bugs that vary a bit in the amount of steps between one training and another, but are always present. I ran more than 3 tests.

Generally, the training runs normally up to a certain number of steps with validation ok from the loss graph and the sample images. Until at some point, either before or more commonly after the checkpoint save, it has a strange bug that causes the loss to have a gigantic spike and the samples to become noised.

I reinstalled all the dependencies between one training and other to have sure its not a dependency bug.

Here have three WANDB Public Running where have more information:
Bugged before checkpoint save at 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/6gt2xll2?workspace=user-jvkkfa

Bugged after 350 steps and before 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/z2gem4to?workspace=user-jvkkfa

Latest run bugged after 700
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/gyyzv1y6?workspace=user-jvkkfa

Reproduction

Normal DPO Run.

!accelerate launch train_diffusion_dpo_sdxl.py
--pretrained_model_name_or_path="/home/ubuntu/mydrive/auto/stable-diffusion-webui/models/Stable-diffusion/ModelStep3"
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix
--output_dir="Nebul-Redmond-SDXL"
--mixed_precision="fp16"
--dataset_name=kashif/pickascore
--train_batch_size=8
--gradient_accumulation_steps=1
--gradient_checkpointing
--use_8bit_adam
--rank=8
--learning_rate=8e-6
--report_to="wandb"
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=2000
--checkpointing_steps=500
--run_validation --validation_steps=50
--seed="0"
--report_to="wandb"
--push_to_hub

Tested other with 1e-5 also.

Logs

No response

System Info

diffusers version: 0.26.0.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version (GPU?): 2.1.2+cu121 (True)
Huggingface_hub version: 0.20.2
Transformers version: 4.36.2
Accelerate version: 0.24.0
xFormers version: 0.0.23
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sayak

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2024-01-15T14:42:22Z

cc @sayakpaul here

sayakpaul · 2024-01-15T14:50:39Z

I don't think it's a library-specific issue, though. It could very well be a case of exploring better hyperparameters. I think this is better suited as a discussion point, honestly.

@radames did a longer training run and obtained good results.

artificialguybr · 2024-01-15T16:09:41Z

I don't think it's a library-specific issue, though. It could very well be a case of exploring better hyperparameters. I think this is better suited as a discussion point, honestly.

@radames did a longer training run and obtained good results.

Its strange because the problem happens even with the default 1e-5 config and others settings.

sayakpaul · 2024-01-15T16:15:03Z

It could very well be because of the funky low-data regime on which the scripts were tested (a disclaimer about that is already mentioned from the README).

I will let @radames comment her a bit since he conducted a longer training run with a larger dataset.

artificialguybr · 2024-01-15T16:32:27Z

I'm open here if you guys need more information or more tests.

radames · 2024-01-15T19:11:03Z

Hi @jvkap, here was my setup, the only big difference is I was using the larger dataset yuvalkirstain/pickapic_v2 and stabilityai/stable-diffusion-xl-base-1.0 base model. One note is the loss curve is a big crazy, however I didn't get that spike. One note, for larger number of steps the validation samples became full of texture artifacts, didn't like that at all, so I ended up choosing the 5k step as the best result.

Are you using a fine-tuned as base model?

train_diffusion_dpo_sdxl.py
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix
--output_dir=diffusion-sdxl-dpo --mixed_precision=fp16
--dataset_name=yuvalkirstain/pickapic_v2
--dataset_split_name=train
--train_batch_size=8
--gradient_accumulation_steps=2
--gradient_checkpointing
--use_8bit_adam
--rank=8
--learning_rate=1e-5
--report_to=wandb
--lr_scheduler=constant
--lr_warmup_steps=0
--max_train_steps=10000
--checkpointing_steps=500
--run_validation
--validation_steps=50
--seed=0
--report_to=wandb 
--push_to_hub

Youngon · 2024-01-17T09:12:22Z

yuvalkirstain/pickapic_v2

what format should the yuvalkirstain/pickapic_v2 dataset be processed for trainning?

sayakpaul · 2024-01-17T09:13:49Z

You might have to use WebDataset for this for efficiency.

asrlhhh · 2024-01-18T23:58:01Z

I'm facing a similar issue, but it's in the opposite direction for me. I've noticed that the loss suddenly decreases to almost zero after some iterations. In my case, I suspect it might be related to my data. Have you come across any other bugs in the script that you've already addressed and the rest of us should be mindful of?

sayakpaul · 2024-01-19T02:16:18Z

If there's no problem with the script, I'd prefer to move this to Discussions.

github-actions · 2024-02-13T15:04:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

artificialguybr added the bug Something isn't working label Jan 13, 2024

patrickvonplaten assigned sayakpaul Jan 15, 2024

artificialguybr mentioned this issue Jan 20, 2024

[FetureRequest]Add DPO train kohya-ss/sd-scripts#1040

Open

github-actions bot added the stale Issues that haven't received updates label Feb 13, 2024

github-actions bot closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

artificialguybr commented Jan 13, 2024

patrickvonplaten commented Jan 15, 2024

sayakpaul commented Jan 15, 2024

artificialguybr commented Jan 15, 2024

sayakpaul commented Jan 15, 2024

artificialguybr commented Jan 15, 2024

radames commented Jan 15, 2024

Youngon commented Jan 17, 2024

sayakpaul commented Jan 17, 2024

asrlhhh commented Jan 18, 2024

sayakpaul commented Jan 19, 2024

github-actions bot commented Feb 13, 2024

Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

Comments

artificialguybr commented Jan 13, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

patrickvonplaten commented Jan 15, 2024

sayakpaul commented Jan 15, 2024

artificialguybr commented Jan 15, 2024

sayakpaul commented Jan 15, 2024

artificialguybr commented Jan 15, 2024

radames commented Jan 15, 2024

Youngon commented Jan 17, 2024

sayakpaul commented Jan 17, 2024

asrlhhh commented Jan 18, 2024

sayakpaul commented Jan 19, 2024

github-actions bot commented Feb 13, 2024