Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems on DPO Diffusion Training. Loss spikes and noised samples generation. #6565

Closed
artificialguybr opened this issue Jan 13, 2024 · 11 comments
Assignees
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@artificialguybr
Copy link

Describe the bug

Using the script located in
diffusers/examples/research_projects/diffusion_dpo to try a DPO on my Stable DIffusion XL model I've seen a few bugs that vary a bit in the amount of steps between one training and another, but are always present. I ran more than 3 tests.

Generally, the training runs normally up to a certain number of steps with validation ok from the loss graph and the sample images. Until at some point, either before or more commonly after the checkpoint save, it has a strange bug that causes the loss to have a gigantic spike and the samples to become noised.

I reinstalled all the dependencies between one training and other to have sure its not a dependency bug.

Here have three WANDB Public Running where have more information:
Bugged before checkpoint save at 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/6gt2xll2?workspace=user-jvkkfa

Bugged after 350 steps and before 500.
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/z2gem4to?workspace=user-jvkkfa

Latest run bugged after 700
https://wandb.ai/jvkkfa/diffusion-dpo-lora-sdxl/runs/gyyzv1y6?workspace=user-jvkkfa

image
image
image

Reproduction

Normal DPO Run.

!accelerate launch train_diffusion_dpo_sdxl.py
--pretrained_model_name_or_path="/home/ubuntu/mydrive/auto/stable-diffusion-webui/models/Stable-diffusion/ModelStep3"
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix
--output_dir="Nebul-Redmond-SDXL"
--mixed_precision="fp16"
--dataset_name=kashif/pickascore
--train_batch_size=8
--gradient_accumulation_steps=1
--gradient_checkpointing
--use_8bit_adam
--rank=8
--learning_rate=8e-6
--report_to="wandb"
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=2000
--checkpointing_steps=500
--run_validation --validation_steps=50
--seed="0"
--report_to="wandb"
--push_to_hub

Tested other with 1e-5 also.

Logs

No response

System Info

  • diffusers version: 0.26.0.dev0
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Huggingface_hub version: 0.20.2
  • Transformers version: 4.36.2
  • Accelerate version: 0.24.0
  • xFormers version: 0.0.23
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sayak

@artificialguybr artificialguybr added the bug Something isn't working label Jan 13, 2024
@patrickvonplaten
Copy link
Contributor

cc @sayakpaul here

@sayakpaul
Copy link
Member

I don't think it's a library-specific issue, though. It could very well be a case of exploring better hyperparameters. I think this is better suited as a discussion point, honestly.

@radames did a longer training run and obtained good results.

@artificialguybr
Copy link
Author

I don't think it's a library-specific issue, though. It could very well be a case of exploring better hyperparameters. I think this is better suited as a discussion point, honestly.

@radames did a longer training run and obtained good results.

Its strange because the problem happens even with the default 1e-5 config and others settings.

@sayakpaul
Copy link
Member

It could very well be because of the funky low-data regime on which the scripts were tested (a disclaimer about that is already mentioned from the README).

I will let @radames comment her a bit since he conducted a longer training run with a larger dataset.

@artificialguybr
Copy link
Author

I'm open here if you guys need more information or more tests.

@radames
Copy link
Contributor

radames commented Jan 15, 2024

Hi @jvkap, here was my setup, the only big difference is I was using the larger dataset yuvalkirstain/pickapic_v2 and stabilityai/stable-diffusion-xl-base-1.0 base model. One note is the loss curve is a big crazy, however I didn't get that spike. One note, for larger number of steps the validation samples became full of texture artifacts, didn't like that at all, so I ended up choosing the 5k step as the best result.

Are you using a fine-tuned as base model?

train_diffusion_dpo_sdxl.py
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0
--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix
--output_dir=diffusion-sdxl-dpo --mixed_precision=fp16
--dataset_name=yuvalkirstain/pickapic_v2
--dataset_split_name=train
--train_batch_size=8
--gradient_accumulation_steps=2
--gradient_checkpointing
--use_8bit_adam
--rank=8
--learning_rate=1e-5
--report_to=wandb
--lr_scheduler=constant
--lr_warmup_steps=0
--max_train_steps=10000
--checkpointing_steps=500
--run_validation
--validation_steps=50
--seed=0
--report_to=wandb 
--push_to_hub
image image

@Youngon
Copy link

Youngon commented Jan 17, 2024

yuvalkirstain/pickapic_v2

what format should the yuvalkirstain/pickapic_v2 dataset be processed for trainning?

@sayakpaul
Copy link
Member

You might have to use WebDataset for this for efficiency.

@asrlhhh
Copy link

asrlhhh commented Jan 18, 2024

I'm facing a similar issue, but it's in the opposite direction for me. I've noticed that the loss suddenly decreases to almost zero after some iterations. In my case, I suspect it might be related to my data. Have you come across any other bugs in the script that you've already addressed and the rest of us should be mindful of?

@sayakpaul
Copy link
Member

If there's no problem with the script, I'd prefer to move this to Discussions.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

6 participants