-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565
Comments
cc @sayakpaul here |
I don't think it's a library-specific issue, though. It could very well be a case of exploring better hyperparameters. I think this is better suited as a discussion point, honestly. @radames did a longer training run and obtained good results. |
Its strange because the problem happens even with the default 1e-5 config and others settings. |
It could very well be because of the funky low-data regime on which the scripts were tested (a disclaimer about that is already mentioned from the README). I will let @radames comment her a bit since he conducted a longer training run with a larger dataset. |
I'm open here if you guys need more information or more tests. |
what format should the yuvalkirstain/pickapic_v2 dataset be processed for trainning? |
You might have to use WebDataset for this for efficiency. |
I'm facing a similar issue, but it's in the opposite direction for me. I've noticed that the loss suddenly decreases to almost zero after some iterations. In my case, I suspect it might be related to my data. Have you come across any other bugs in the script that you've already addressed and the rest of us should be mindful of? |
If there's no problem with the script, I'd prefer to move this to Discussions. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
Using the script located in
diffusers/examples/research_projects/diffusion_dpo to try a DPO on my Stable DIffusion XL model I've seen a few bugs that vary a bit in the amount of steps between one training and another, but are always present. I ran more than 3 tests.
Generally, the training runs normally up to a certain number of steps with validation ok from the loss graph and the sample images. Until at some point, either before or more commonly after the checkpoint save, it has a strange bug that causes the loss to have a gigantic spike and the samples to become noised.
I reinstalled all the dependencies between one training and other to have sure its not a dependency bug.
Here have three WANDB Public Running where have more information:
Bugged before checkpoint save at 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/6gt2xll2?workspace=user-jvkkfa
Bugged after 350 steps and before 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/z2gem4to?workspace=user-jvkkfa
Latest run bugged after 700
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/gyyzv1y6?workspace=user-jvkkfa
Reproduction
Normal DPO Run.
!accelerate launch train_diffusion_dpo_sdxl.py
--pretrained_model_name_or_path="/home/ubuntu/mydrive/auto/stable-diffusion-webui/models/Stable-diffusion/ModelStep3"
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix
--output_dir="Nebul-Redmond-SDXL"
--mixed_precision="fp16"
--dataset_name=kashif/pickascore
--train_batch_size=8
--gradient_accumulation_steps=1
--gradient_checkpointing
--use_8bit_adam
--rank=8
--learning_rate=8e-6
--report_to="wandb"
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=2000
--checkpointing_steps=500
--run_validation --validation_steps=50
--seed="0"
--report_to="wandb"
--push_to_hub
Tested other with 1e-5 also.
Logs
No response
System Info
diffusers
version: 0.26.0.dev0Who can help?
@sayak
The text was updated successfully, but these errors were encountered: