Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdxl-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038

Merged
merged 7 commits into from
Jun 14, 2024

Conversation

imangohari1
Copy link
Contributor

What does this PR do?

Mirrors the changes here huggingface/diffusers@8edaf3b

@regisss @dsocek @libinta
Hi team,
Opening this PR to fix up some SDXL issues related to gated dataset. Tests are completed and provided in below.

Fixes # (issue)
gated dataset for sdxl training

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Tests

1x HPU

------------------------------------------------------------------------------
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
Repo card metadata block was not found. Setting CardData to empty.
06/04/2024 15:04:42 - WARNING - huggingface_hub.repocard - Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1221/1221 [01:03<00:00, 19.16 examples/s]
06/04/2024 15:05:49 - INFO - __main__ - ***** Running training *****
06/04/2024 15:05:49 - INFO - __main__ -   Num examples = 1221
06/04/2024 15:05:49 - INFO - __main__ -   Num Epochs = 33
06/04/2024 15:05:49 - INFO - __main__ -   Instantaneous batch size per device = 16
06/04/2024 15:05:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
06/04/2024 15:05:49 - INFO - __main__ -   Gradient Accumulation steps = 1
06/04/2024 15:05:49 - INFO - __main__ -   Total optimization steps = 2500
|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [37:34<00:00,  2.11it/s, lr=1e-5, mem_used=91.7, step_loss=0.0766]06/04/2024 15:43:23 - INFO - accelerate.accelerator - Saving current state to sdxl-pokemon-model/checkpoint-2500
Configuration saved in sdxl-pokemon-model/checkpoint-2500/unet/config.json
Model weights saved in sdxl-pokemon-model/checkpoint-2500/unet/diffusion_pytorch_model.safetensors

8x HPU

Map: 100%|██████████| 1221/1221 [00:02<00:00, 470.95 examples/s]
Map: 100%|██████████| 1221/1221 [00:02<00:00, 467.38 examples/s]
Map: 100%|██████████| 1221/1221 [00:02<00:00, 462.32 examples/s]
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
MediaPipe device GAUDI2 device_type GAUDI2 device_id 0 pipe_name SDXLMediaPipe:0
06/04/2024 17:12:23 - INFO - media_pipe_imgdir - Finding largest file ...
06/04/2024 17:12:23 - INFO - __main__ - ***** Running training *****
06/04/2024 17:12:23 - INFO - __main__ -   Num examples = 1221
06/04/2024 17:12:23 - INFO - __main__ -   Num Epochs = 38
06/04/2024 17:12:23 - INFO - __main__ -   Instantaneous batch size per device = 16
06/04/2024 17:12:23 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 128
06/04/2024 17:12:23 - INFO - __main__ -   Gradient Accumulation steps = 1
06/04/2024 17:12:23 - INFO - __main__ -   Total optimization steps = 336
Steps:   0%|          | 0/336 [00:00<?, ?it/s]Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:24 - INFO - media_pipe_imgdir - Finding largest file ...
06/04/2024 17:12:24 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:25 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:26 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:29 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:31 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
06/04/2024 17:12:37 - INFO - media_pipe_imgdir - Finding largest file ...
Warning: Decoder updated User configured Interpolation from Bilinear to Bicubic
Steps: 100%|██████████| 336/336 [22:55<00:00,  1.92it/s, lr=1e-5, mem_used=86.2, step_loss=0.101]06/04/2024 17:35:18 - INFO - accelerate.accelerator - Saving current state to sdxl_model_output/checkpoint-336
Configuration saved in sdxl_model_output/checkpoint-336/unet/config.json
Model weights saved in sdxl_model_output/checkpoint-336/unet/diffusion_pytorch_model.safetensors

CI

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375232 KB
------------------------------------------------------------------------------
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
Repo card metadata block was not found. Setting CardData to empty.
06/04/2024 18:05:04 - WARNING - huggingface_hub.repocard - Repo card metadata block was not found. Setting CardData to empty.
Map:   0%|                                                                                                                                                                                             | 0/1221 [00:00<?, ? examples/s]Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1221/1221 [00:56<00:00, 21.59 examples/s]
06/04/2024 18:06:03 - INFO - __main__ - ***** Running training *****
06/04/2024 18:06:03 - INFO - __main__ -   Num examples = 1221
06/04/2024 18:06:03 - INFO - __main__ -   Num Epochs = 1
06/04/2024 18:06:03 - INFO - __main__ -   Instantaneous batch size per device = 16
06/04/2024 18:06:03 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
06/04/2024 18:06:03 - INFO - __main__ -   Gradient Accumulation steps = 1
06/04/2024 18:06:03 - INFO - __main__ -   Total optimization steps = 2
Steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [06:20<00:00, 192.97s/it, lr=1e-5, mem_used=25, step_loss=0.342]06/04/2024 18:12:23 - INFO - accelerate.accelerator - Saving current state to /tmp/tmp2m2qr9t8/checkpoint-2
Configuration saved in /tmp/tmp2m2qr9t8/checkpoint-2/unet/config.json
Model weights saved in /tmp/tmp2m2qr9t8/checkpoint-2/unet/diffusion_pytorch_model.safetensors
06/04/2024 18:17:37 - INFO - accelerate.checkpointing - Optimizer state saved in /tmp/tmp2m2qr9t8/checkpoint-2/optimizer.bin
06/04/2024 18:17:37 - INFO - accelerate.checkpointing - Scheduler state saved in /tmp/tmp2m2qr9t8/checkpoint-2/scheduler.bin
06/04/2024 18:17:37 - INFO - accelerate.checkpointing - Sampler state for dataloader 0 saved in /tmp/tmp2m2qr9t8/checkpoint-2/sampler.bin
06/04/2024 18:17:37 - INFO - accelerate.checkpointing - Random states saved in /tmp/tmp2m2qr9t8/checkpoint-2/random_states_0.pkl
06/04/2024 18:17:37 - INFO - __main__ - Saved state to /tmp/tmp2m2qr9t8/checkpoint-2
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [11:34<00:00, 347.22s/it, lr=1e-5, mem_used=25.1, step_loss=0.387]
PASSED

========================================================================================================== warnings summary ===========================================================================================================
../../usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63
../../usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63
  /usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
    torch.utils._pytree._register_pytree_node(

../../usr/local/lib/python3.10/dist-packages/lightning_utilities/core/imports.py:14
  /usr/local/lib/python3.10/dist-packages/lightning_utilities/core/imports.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py:2825
  /usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py:2825: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================== 2 passed, 60 deselected, 4 warnings in 777.32s (0:12:57) =======================================================================================

@imangohari1 imangohari1 requested a review from regisss as a code owner June 4, 2024 19:30
Copy link
Contributor

@dsocek dsocek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The diffusers have similar update as well. The switch to VAE fix fp16 is also needed.

--gaudi_config_name Habana/stable-diffusion \
--throughput_warmup_steps 3 \
--dataloader_num_workers 8 \
--bf16 \
--use_hpu_graphs_for_training \
--use_hpu_graphs_for_inference \
--validation_prompt="a robotic cat with wings" \
--validation_prompt="a cute Sundar Pichai creature" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we put a more neutral prompt here, for example "a cute dragon creature"? Not sure if Google will get mad here..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 36ecac0

@imangohari1 imangohari1 changed the title sdxl-trainig: fixed ci and changed gated dataset sdxl-trainig: fixed ci, changed gated dataset, fixes for non-square datasets Jun 5, 2024
@imangohari1
Copy link
Contributor Author

imangohari1 commented Jun 5, 2024

@regisss @ssarkar2
Re your previous discussion here: #787 (comment)
I changed the torch dataloader to the diffuser one here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py#L867-L894

This change seems to be working fine for square data (in README.md) and none-square one such as linoyts/Tuxemon

python train_text_to_image_sdxl.py   --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0   --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix   --dataset_name linoyts/Tuxemon   --resolution 256   --crop_resolution 256   --center_crop   --random_flip   --proportion_empty_prompts=0.2   --train_batch_size 16   --max_train_steps 2500   --learning_rate 1e-05   --max_grad_norm 1   --lr_scheduler constant   --lr_warmup_steps 0   --output_dir sdxl_model_output   --gaudi_config_name Habana/stable-diffusion   --throughput_warmup_steps 3   --dataloader_num_workers 8   --bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --validation_prompt="a cute Sundar Pichai creature"   --validation_epochs 48   --checkpointing_steps 2500   --logging_step 10   --adjust_throughput --caption_column prompt

.
.
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-04 20:56:58,408 >> The first two iterations are slower so it is recommended to feed more batches.
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-04 20:57:01,461 >> Speed metrics: {'generation_runtime': 3.0052, 'generation_samples_per_second': 0.35, 'generation_steps_per_second': 0.35}
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-04 20:57:01,588 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-04 20:57:01,588 >> The first two iterations are slower so it is recommended to feed more batches.
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-04 20:57:04,276 >> Speed metrics: {'generation_runtime': 2.642, 'generation_samples_per_second': 0.381, 'generation_steps_per_second': 0.381}
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-04 20:57:04,398 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-04 20:57:04,398 >> The first two iterations are slower so it is recommended to feed more batches.
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-04 20:57:07,083 >> Speed metrics: {'generation_runtime': 2.6403, 'generation_samples_per_second': 0.381, 'generation_steps_per_second': 0.381}
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-04 20:57:07,205 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-04 20:57:07,205 >> The first two iterations are slower so it is recommended to feed more batches.
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-04 20:57:09,888 >> Speed metrics: {'generation_runtime': 2.6384, 'generation_samples_per_second': 0.381, 'generation_steps_per_second': 0.381}
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [39:21<00:00,  1.63it/s, lr=1e-5, mem_used=61.7, step_loss=0.144]06/04/2024 20:58:47 - INFO - accelerate.accelerator - Saving current state to sdxl_model_output/checkpoint-2500
Configuration saved in sdxl_model_output/checkpoint-2500/unet/config.json
Model weights saved in sdxl_model_output/checkpoint-2500/unet/diffusion_pytorch_model.safetensors


[WARNING|pipeline_utils.py:149] 2024-06-04 21:00:27,103 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
Configuration saved in sdxl_model_output/vae/config.json
Model weights saved in sdxl_model_output/vae/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/unet/config.json

[INFO|configuration_utils.py:358] 2024-06-04 21:00:27,103 >> GaudiConfig {
  "autocast_bf16_ops": null,
  "autocast_fp32_ops": null,
  "optimum_version": "1.20.0",
  "transformers_version": "4.40.2",
  "use_dynamic_shapes": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "use_torch_autocast": true
}

[WARNING|pipeline_utils.py:149] 2024-06-04 21:00:27,103 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
Configuration saved in sdxl_model_output/vae/config.json
Model weights saved in sdxl_model_output/vae/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/unet/config.json
Model weights saved in sdxl_model_output/unet/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/scheduler/scheduler_config.json
Configuration saved in sdxl_model_output/model_index.json
[INFO|configuration_utils.py:113] 2024-06-04 21:00:35,647 >> Configuration saved in sdxl_model_output/gaudi_config.json
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-04 21:00:35,804 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-04 21:00:35,804 >> The first two iterations are slower so it is recommended to feed more batches.
                                                                                                                                                                                                                      ^100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:14<00:00, 74.22s/it]
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-04 21:01:50,062 >> Speed metrics: {'generation_runtime': 74.2221, 'generation_samples_per_second': 0.143, 'generation_steps_per_second': 0.143}:14<00:00, 74.22s/it]

I've confirmed the same test above with BS=1 and BS=2 as well.

Another test is done with poloclub/diffusiondb (none-qaure) and BS=2

python train_text_to_image_sdxl.py   --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0   --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix   --dataset_name poloclub/diffusiondb   --resolution 512   --crop_resolution 512   --center_crop   --random_flip   --proportion_empty_prompts=0.2   --train_batch_size 2   --max_train_steps 1000   --learning_rate 1e-05   --max_grad_norm 1   --lr_scheduler constant   --lr_warmup_steps 0   --output_dir sdxl_model_output   --gaudi_config_name Habana/stable-diffusion   --throughput_warmup_steps 3   --dataloader_num_workers 8   --bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --validation_prompt="a cute Sundar Pichai creature"   --validation_epochs 48   --checkpointing_steps 1000   --logging_step 10   --adjust_throughput --caption_column prompt
.
.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'variance_type', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375284 KB
------------------------------------------------------------------------------
{'reverse_transformer_layers_per_block', 'dropout', 'attention_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for poloclub/diffusiondb contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/poloclub/diffusiondb
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:29<00:00, 33.35 examples/s]
06/05/2024 02:19:08 - INFO - __main__ - ***** Running training *****
06/05/2024 02:19:08 - INFO - __main__ -   Num examples = 1000
06/05/2024 02:19:08 - INFO - __main__ -   Num Epochs = 2
06/05/2024 02:19:08 - INFO - __main__ -   Instantaneous batch size per device = 2
06/05/2024 02:19:08 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
06/05/2024 02:19:08 - INFO - __main__ -   Gradient Accumulation steps = 1
06/05/2024 02:19:08 - INFO - __main__ -   Total optimization steps = 1000
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [17:58<00:00,  3.62it/s, lr=1e-5, mem_used=31.5, step_loss=0.341]06/05/2024 02:37:06 - INFO - accelerate.accelerator - Saving current state to sdxl_model_output/checkpoint-1000
Configuration saved in sdxl_model_output/checkpoint-1000/unet/config.json
Model weights saved in sdxl_model_output/checkpoint-1000/unet/diffusion_pytorch_model.safetensors
06/05/2024 02:38:45 - INFO - accelerate.checkpointing - Optimizer state saved in sdxl_model_output/checkpoint-1000/optimizer.bin
06/05/2024 02:38:45 - INFO - accelerate.checkpointing - Scheduler state saved in sdxl_model_output/checkpoint-1000/scheduler.bin
06/05/2024 02:38:45 - INFO - accelerate.checkpointing - Sampler state for dataloader 0 saved in sdxl_model_output/checkpoint-1000/sampler.bin
06/05/2024 02:38:45 - INFO - accelerate.checkpointing - Random states saved in sdxl_model_output/checkpoint-1000/random_states_0.pkl
06/05/2024 02:38:45 - INFO - __main__ - Saved state to sdxl_model_output/checkpoint-1000
Steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [19:37<00:00,  3.62it/s, lr=1e-5, mem_used=31.5, step_loss=0.0351]06/05/2024 02:38:45 - INFO - __main__ - Throughput = 9.616218221859341 samples/s
06/05/2024 02:38:45 - INFO - __main__ - Train runtime = 207.35802308097482 seconds
06/05/2024 02:38:45 - INFO - __main__ - Total Train runtime = 1177.2033570841886 seconds
{'image_encoder', 'gaudi_config', 'bf16_full_eval', 'feature_extractor', 'use_habana', 'use_hpu_graphs'} was not found in config. Values will be initialized to default values.
                                                                                                                                                                                                                                      Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                                      Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.████████████████████████████████████████████▏                                              | 5/7 [00:00<00:00, 45.63it/s]
Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  9.74it/s]
[INFO|pipeline_utils.py:130] 2024-06-05 02:38:47,002 >> Enabled HPU graphs.
[INFO|configuration_utils.py:305] 2024-06-05 02:38:47,094 >> loading configuration file gaudi_config.json from cache at /root/.cache/huggingface/hub/models--Habana--stable-diffusion/snapshots/60ee357057ec90d2b183de22d0327ddd5d5a6db9/gaudi_config.json
[INFO|configuration_utils.py:358] 2024-06-05 02:38:47,094 >> GaudiConfig {
  "autocast_bf16_ops": null,
  "autocast_fp32_ops": null,
  "optimum_version": "1.20.0",
  "transformers_version": "4.40.2",
  "use_dynamic_shapes": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "use_torch_autocast": true
}

[WARNING|pipeline_utils.py:149] 2024-06-05 02:38:47,094 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
Configuration saved in sdxl_model_output/vae/config.json
Model weights saved in sdxl_model_output/vae/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/unet/config.json
Model weights saved in sdxl_model_output/unet/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/scheduler/scheduler_config.json
Configuration saved in sdxl_model_output/model_index.json
[INFO|configuration_utils.py:113] 2024-06-05 02:38:55,719 >> Configuration saved in sdxl_model_output/gaudi_config.json
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-05 02:38:55,905 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-05 02:38:55,905 >> The first two iterations are slower so it is recommended to feed more batches.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:18<00:00, 78.79s/it]
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-05 02:40:14,729 >> Speed metrics: {'generation_runtime': 78.7885, 'generation_samples_per_second': 0.139, 'generation_steps_per_second': 0.139}███████| 1/1 [01:18<00:00, 78.79s/it]
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-05 02:40:26,698 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-05 02:40:26,698 >> The first two iterations are slower so it is recommended to feed more batches.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.18s/it]
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-05 02:40:32,921 >> Speed metrics: {'generation_runtime': 6.1851, 'generation_samples_per_second': 0.249, 'generation_steps_per_second': 0.249}████████| 1/1 [00:06<00:00,  6.18s/it]
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-05 02:40:33,069 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-05 02:40:33,069 >> The first two iterations are slower so it is recommended to feed more batches.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.63s/it]
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-05 02:40:35,746 >> Speed metrics: {'generation_runtime': 2.6331, 'generation_samples_per_second': 0.383, 'generation_steps_per_second': 0.383}████████| 1/1 [00:02<00:00,  2.63s/it]
[INFO|pipeline_stable_diffusion_xl.py:537] 2024-06-05 02:40:35,847 >> 1 prompt(s) received, 1 generation(s) per prompt, 1 sample(s) per batch, 1 total batch(es).
[WARNING|pipeline_stable_diffusion_xl.py:542] 2024-06-05 02:40:35,848 >> The first two iterations are slower so it is recommended to feed more batches.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.62s/it]
[INFO|pipeline_stable_diffusion_xl.py:825] 2024-06-05 02:40:38,507 >> Speed metrics: {'generation_runtime': 2.6246, 'generation_samples_per_second': 0.383, 'generation_steps_per_second': 0.383}████████| 1/1 [00:02<00:00,  2.62s/it]
06/05/2024 02:40:38 - INFO - __main__ - Saving images in /root/optimum-habana/examples/stable-diffusion/training/stable-diffusion-generated-images...
Steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [21:34<00:00,  1.29s/it, lr=1e-5, mem_used=31.5, step_loss=0.0351]

-> above test with BS=16 crashes with OOM.

@imangohari1
Copy link
Contributor Author

@regisss Would appreciate a review on this when you had a chance. Thank you. .

@regisss
Copy link
Collaborator

regisss commented Jun 11, 2024

The Gaudi1 command in the README fails with

terminate called after throwing an instance of 'c10::Error'
  what():  [Rank:0] FATAL ERROR :: MODULE:PT_DEVICE GRAPH:: Capture must end on the same stream it began on.
[Rank:0] Habana exception raised from mark_step at HPUGraph.cpp:138

on Synapse 1.16.
And compilation times are quite long on Gaudi2 with Synapse 1.15.
@imangohari1 Do you also see these behaviors?

@imangohari1
Copy link
Contributor Author

The Gaudi1 command in the README fails with

terminate called after throwing an instance of 'c10::Error'
  what():  [Rank:0] FATAL ERROR :: MODULE:PT_DEVICE GRAPH:: Capture must end on the same stream it began on.
[Rank:0] Habana exception raised from mark_step at HPUGraph.cpp:138

on Synapse 1.16. And compilation times are quite long on Gaudi2 with Synapse 1.15. @imangohari1 Do you also see these behaviors?

@regisss
wrt to G1: I was able to reproduce the error you shared on AWS dl1. Our team had provided originally tested this on an internal system with a different configuration than dl1. Dropping the PT_HPU_MAX_COMPOUND_OP_SIZE=5, the cmd leads to an OOM error on dl1.
I was able to get this running on dl1 with dropping the resolution to 256 instead of 512 (below). We can update this instruction if it is expected to work on dl1.

python train_text_to_image_sdxl.py   --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0   --pretrained_vae_model_n ame_or_path madebyollin/sdxl-vae-fp16-fix   --dataset_name lambdalabs/naruto-blip-captions   --resolution 256   --center_crop   --random_flip   --proportion_empty_prompts=0.2   --train_batch_size 1   --gradient_accumulation_steps 4   --max_train_steps 3000   --learning_rate 1e-05   --max_grad_norm 1   --lr_scheduler constant   --lr_warmup_steps 0   --output_dir sdxl_model_output   --gaudi_config_name Habana/stable-diffusion   --throughput_warmup_steps 3   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --bf16

wrt to graph compile: this will be improved in future releases.

@imangohari1
Copy link
Contributor Author

The Gaudi1 command in the README fails with

terminate called after throwing an instance of 'c10::Error'
  what():  [Rank:0] FATAL ERROR :: MODULE:PT_DEVICE GRAPH:: Capture must end on the same stream it began on.
[Rank:0] Habana exception raised from mark_step at HPUGraph.cpp:138

on Synapse 1.16. And compilation times are quite long on Gaudi2 with Synapse 1.15. @imangohari1 Do you also see these behaviors?

@regisss
I ran some experiments on G1 with different hyperparameters. Based on the results and memory usage, @ssarkar2 and I agreed that dropping the G1 instruction to 256 res is the most straightforward.
I made that change c6b0ee3.

I ran this on the updated g1 example and it completed.

06/13/2024 01:30:47 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: hpu

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 144
CPU RAM       : 1056455416 KB
------------------------------------------------------------------------------
{'reverse_transformer_layers_per_block', 'attention_type', 'dropout'} was not found in config. Values will be initialized to default values.
Repo card metadata block was not found. Setting CardData to empty.
06/13/2024 01:30:52 - WARNING - huggingface_hub.repocard - Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████| 1221/1221 [01:56<00:00, 10.45 examples/s]
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-06-13 01:32:54,220] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
06/13/2024 01:32:54 - INFO - __main__ - ***** Running training *****
06/13/2024 01:32:54 - INFO - __main__ -   Num examples = 1221
06/13/2024 01:32:54 - INFO - __main__ -   Num Epochs = 10
06/13/2024 01:32:54 - INFO - __main__ -   Instantaneous batch size per device = 1
06/13/2024 01:32:54 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
06/13/2024 01:32:54 - INFO - __main__ -   Gradient Accumulation steps = 4
06/13/2024 01:32:54 - INFO - __main__ -   Total optimization steps = 3000
Steps: 100%|██████████| 3000/3000 [1:00:03<00:00,  1.12it/s, lr=1e-5, mem_used=25.1, step_loss=0.00386]06/13/2024 02:32:58 - INFO - accelerate.accelerator - Saving current state to sdxl_model_output/checkpoint-3000
Configuration saved in sdxl_model_output/checkpoint-3000/unet/config.json
Model weights saved in sdxl_model_output/checkpoint-3000/unet/diffusion_pytorch_model.safetensors
06/13/2024 02:34:44 - INFO - accelerate.checkpointing - Optimizer state saved in sdxl_model_output/checkpoint-3000/optimizer.bin
06/13/2024 02:34:44 - INFO - accelerate.checkpointing - Scheduler state saved in sdxl_model_output/checkpoint-3000/scheduler.bin
06/13/2024 02:34:44 - INFO - accelerate.checkpointing - Sampler state for dataloader 0 saved in sdxl_model_output/checkpoint-3000/sampler.bin
06/13/2024 02:34:44 - INFO - accelerate.checkpointing - Random states saved in sdxl_model_output/checkpoint-3000/random_states_0.pkl
06/13/2024 02:34:44 - INFO - __main__ - Saved state to sdxl_model_output/checkpoint-3000
Steps: 100%|██████████| 3000/3000 [1:01:49<00:00,  1.12it/s, lr=1e-5, mem_used=25.1, step_loss=0.091]  06/13/2024 02:34:44 - INFO - __main__ - Throughput = 4.167025074619193 samples/s
06/13/2024 02:34:44 - INFO - __main__ - Train runtime = 2876.8725374410024 seconds
06/13/2024 02:34:44 - INFO - __main__ - Total Train runtime = 3709.792234335 seconds
Fetching 14 files: 100%|██████████| 14/14 [00:11<00:00,  1.25it/s]
{'feature_extractor', 'bf16_full_eval', 'image_encoder', 'use_habana', 'gaudi_config', 'use_hpu_graphs'} was not found in config. Values will be initialized to default values.
                                                                     Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                             Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00,  7.93it/s]
[INFO|pipeline_utils.py:130] 2024-06-13 02:34:56,910 >> Enabled HPU graphs.
[INFO|configuration_utils.py:305] 2024-06-13 02:34:57,000 >> loading configuration file gaudi_config.json from cache at /root/.cache/huggingface/hub/models--Habana--stable-diffusion/snapshots/60ee357057ec90d2b183de22d0327ddd5d5a6db9/gaudi_config.json
[INFO|configuration_utils.py:358] 2024-06-13 02:34:57,000 >> GaudiConfig {
  "autocast_bf16_ops": null,
  "autocast_fp32_ops": null,
  "optimum_version": "1.20.0",
  "transformers_version": "4.40.2",
  "use_dynamic_shapes": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "use_torch_autocast": true
}

[WARNING|pipeline_utils.py:149] 2024-06-13 02:34:57,000 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
Configuration saved in sdxl_model_output/vae/config.json
Model weights saved in sdxl_model_output/vae/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/unet/config.json
Model weights saved in sdxl_model_output/unet/diffusion_pytorch_model.safetensors
Configuration saved in sdxl_model_output/scheduler/scheduler_config.json
Configuration saved in sdxl_model_output/model_index.json
[INFO|configuration_utils.py:113] 2024-06-13 02:35:45,348 >> Configuration saved in sdxl_model_output/gaudi_config.json
Steps: 100%|██████████| 3000/3000 [1:02:50<00:00,  1.26s/it, lr=1e-5, mem_used=25.1, step_loss=0.091]

The note about the first 2 steps is in the README as well.
image

Could you please review this again?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss
Copy link
Collaborator

regisss commented Jun 13, 2024

LGTM!
@imangohari1 Can you also run make style to have the code style check pass please?

@imangohari1
Copy link
Contributor Author

LGTM! @imangohari1 Can you also run make style to have the code style check pass please?

Sure and Done
9e45dd4

ran the python -m pytest tests/test_diffusers.py -v -s -k "test_train_text_to_image_" and works fine:

------------------------------------------------------------------------------
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
Repo card metadata block was not found. Setting CardData to empty.
06/13/2024 15:46:06 - WARNING - huggingface_hub.repocard - Repo card metadata block was not found. Setting CardData to empty.
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1221/1221 [01:06<00:00, 18.32 examples/s]
06/13/2024 15:47:20 - INFO - __main__ - ***** Running training *****
06/13/2024 15:47:20 - INFO - __main__ -   Num examples = 1221
06/13/2024 15:47:20 - INFO - __main__ -   Num Epochs = 1
06/13/2024 15:47:20 - INFO - __main__ -   Instantaneous batch size per device = 16
06/13/2024 15:47:20 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
06/13/2024 15:47:20 - INFO - __main__ -   Gradient Accumulation steps = 1
06/13/2024 15:47:20 - INFO - __main__ -   Total optimization steps = 2
Steps: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [09:41<00:00, 306.10s/it, lr=1e-5, mem_used=27.7, step_loss=0.34]06/13/2024 15:57:01 - INFO - accelerate.accelerator - Saving current state to /tmp/tmp6h_7me7t/checkpoint-2
Configuration saved in /tmp/tmp6h_7me7t/checkpoint-2/unet/config.json
Model weights saved in /tmp/tmp6h_7me7t/checkpoint-2/unet/diffusion_pytorch_model.safetensors
06/13/2024 16:04:24 - INFO - accelerate.checkpointing - Optimizer state saved in /tmp/tmp6h_7me7t/checkpoint-2/optimizer.bin
06/13/2024 16:04:24 - INFO - accelerate.checkpointing - Scheduler state saved in /tmp/tmp6h_7me7t/checkpoint-2/scheduler.bin
06/13/2024 16:04:24 - INFO - accelerate.checkpointing - Sampler state for dataloader 0 saved in /tmp/tmp6h_7me7t/checkpoint-2/sampler.bin
06/13/2024 16:04:24 - INFO - accelerate.checkpointing - Random states saved in /tmp/tmp6h_7me7t/checkpoint-2/random_states_0.pkl
06/13/2024 16:04:24 - INFO - __main__ - Saved state to /tmp/tmp6h_7me7t/checkpoint-2
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [17:04<00:00, 512.21s/it, lr=1e-5, mem_used=27.7, step_loss=0.503]
PASSED

========================================================================================================== warnings summary ===========================================================================================================
../../../../../../usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:485
  /usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:485: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
    _torch_pytree._register_pytree_node(

../../../../../../usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:342
  /usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:342: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
    _torch_pytree._register_pytree_node(

../../../../../../usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63
../../../../../../usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63
  /usr/local/lib/python3.10/dist-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
    torch.utils._pytree._register_pytree_node(

../../../../../../usr/local/lib/python3.10/dist-packages/lightning_utilities/core/imports.py:14
  /usr/local/lib/python3.10/dist-packages/lightning_utilities/core/imports.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../../../../../usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py:2832
  /usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================== 2 passed, 60 deselected, 6 warnings in 1217.80s (0:20:17) ======================================================================================

@regisss regisss merged commit d08be4b into huggingface:main Jun 14, 2024
2 of 3 checks passed
@imangohari1 imangohari1 deleted the sdxl-update branch August 8, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants