Skip to content

Conversation

@6DammK9
Copy link

@6DammK9 6DammK9 commented Feb 23, 2025

Referring to Issue #1947 and PR #1359 .

  • After code inspection, base (SD1) script fine_tune.py can be merged with the concepts in train_network.py, and becomes train_native.py.
  • Some network exclusive features (--skip_until_initial_step, --validation_split) has been added,
  • Tweaked for --mem_eff_attn, --xformers which applies for more aggressive checking (probable still VAE only?)

Tested with SDXL with this CLI command (hint: many features):

accelerate launch sdxl_train.py 
  --pretrained_model_name_or_path="/run/media/user/PM863a/stable-diffusion-webui/models/Stable-diffusion/x215c-AstolfoMix-24101101-6e545a3.safetensors" 
  --in_json "/run/media/user/Intel P4510 3/just_astolfo/test_lat_v3.json" 
  --train_data_dir="/run/media/user/Intel P4510 3/just_astolfo/test" 
  --output_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/model_out" 
  --log_with=tensorboard --logging_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/tensorboard" --log_prefix=just_astolfo_25022301_ 
  --seed=25022301 --save_model_as=safetensors --caption_extension=".txt" --enable_wildcard 
  --use_8bit_adam 
  --learning_rate=1e-6 --train_text_encoder --learning_rate_te1=1e-5 --learning_rate_te2=1e-5 
  --max_train_epochs=4 
  --xformers --mem_eff_attn --torch_compile --dynamo_backend=inductor --gradient_checkpointing 
  --deepspeed --gradient_accumulation_steps=4 --max_grad_norm=0 
  --train_batch_size=1 --full_bf16 --mixed_precision=bf16 --save_precision=fp16 
  --enable_bucket --cache_latents 
  --save_every_n_epochs=2 
  --skip_until_initial_step --initial_step=1 --initial_epoch=1

And the following accelerate config:

accelerate config
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine                                                                                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                                                  
multi-GPU                                                                                                                                                             
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                            
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO                                     
Do you wish to optimize your script with torch dynamo?[yes/NO]:yes                                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which dynamo backend would you like to use?                                                                                                                           
inductor                                                                                                                                                              
Do you want to customize the defaults sent to torch.compile? [yes/NO]: NO                                                                                             
Do you want to use DeepSpeed? [yes/NO]: yes                                                                                                                           
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: NO                                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
What should be your DeepSpeed's ZeRO optimization stage?                                                                                                              
2                                                                                                                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where to offload optimizer states?                                                                                                                                    
none                                                                                                                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where to offload parameters?                                                                                                                                          
none                                                                                                                                                                  
How many gradient accumulation steps you're passing in your script? [1]: 4                                                                                            
Do you want to use gradient clipping? [yes/NO]: NO                                                                                                                    
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO                                                     
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:4
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16                                                                                                                                                                  
accelerate configuration saved at /home/user/.cache/huggingface/accelerate/default_config.yaml 

(A bit off topic) It runs for around 15.5s / it (4 cards x 4 accumulation steps) with 4x RTX 3090 24GB (X299 DARK, 10980XE, P4510 4TB).

@6DammK9
Copy link
Author

6DammK9 commented Mar 5, 2025

Expand --skip_cache_check from #1246 to break the freezing on "multinode training". It will skip all os.file.exists, which load the meta_lat.json 1000x times faster (3Hr > 30sec).
Instead of multiple DGX (8x H100) machines, multiple NUMA a.k.a. dual socket CPU machines fall in this case.
Currently testing with --enable_cpu_affinity in accelerate, and faced bottleneck from NUMA / QPI instead of just PCIE bandwidth (i.e. Supermicro X12DPI + SFF8654 risers is slower than EVGA X299 Dark). I'm curious how this trainer (or any SDXL trainer) scales, with either 3090 / 4090 or A100 / H100.
I suspect that it only scales well with DGX tier machines, otherwise it is only suitable for sharding large models (Flux / SD3, but not SDXL). Otherwise just freeze some UNET layers or "train network" for maximum efficiency.

@6DammK9
Copy link
Author

6DammK9 commented Mar 15, 2025

Included #1985 (implied #1409, #837 but not #1468). From discussion in GUI repo, there are more memory efficient optimizers other than AdamW8Bit. However, from academic prespective, ADAMW is a good reference point for predicting the parameters for other optimizers. It is most discussed and compared, you don't have to search hyperparameters in an exhausive way. BTW you don't have to modify the secondary parameters (beta, eps etc).

accelerate launch sdxl_train_v2.py 
  --pretrained_model_name_or_path="/run/media/user/Intel P4510 3/astolfo_xl/x255c-AstolfoMix-25022801-1458190.safetensors" 
  --in_json "/run/media/user/Intel P4510 3/just_astolfo/meta_lat_v3.json" 
  --train_data_dir="/run/media/user/Intel P4510 3/just_astolfo/kohyas_finetune" 
  --output_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/model_out" 
  --log_with=tensorboard --logging_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/tensorboard" --log_prefix=just_astolfo_25030801_ 
  --seed=25030801 --save_model_as=safetensors --caption_extension=".txt" --enable_wildcard 
  --optimizer_type "pytorch_optimizer.CAME" --optimizer_args "weight_decay=1e-2" --learning_rate="1e-6" --train_text_encoder --learning_rate_te1="1e-5" --learning_rate_te2="1e-5" 
  --max_train_epochs=10 
  --xformers --gradient_checkpointing --gradient_accumulation_steps=4 --max_grad_norm=0 
  --max_data_loader_n_workers=32 --persistent_data_loader_workers  --pin_memory
  --train_batch_size=1 --full_bf16 --mixed_precision=bf16 --save_precision=fp16 
  --enable_bucket --cache_latents --skip_cache_check --save_every_n_epochs=1
#--deepspeed --mem_eff_attn --torch_compile --dynamo_backend=inductor 
#--skip_until_initial_step --initial_step=1 --initial_epoch=1
#numactl --cpunodebind=1 --membind=1 

@6DammK9
Copy link
Author

6DammK9 commented Apr 6, 2025

Added profiler support:

  • I was stuck in stalling in all_reduce_training_model stage (HF Accelerate, Pytorch DDP, NCCL) in EPYC platform (comparing with Intel X299), turns out it is because of CCD count.
  • Look for ProfileKwargs and with accelerator.profile() for details (I hardcoded to output to tmpfs otherwise it will be slow for 400MB traces for each training step).
  • all_reduce_training_model is required for multiple GPU multiple model trainning (SD = UNET + TE + VAE). This is the exclusive case for SD out of many generative models.

Meanwhile fixed the checking of initial_step calculation and make a version of requirements.txt without any verion cap. accelerate has been in 1.6.0 instead of 0.33.0. I need mega version bump for multi GPU training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant