PoC: Rewrite fine_tune.py as train_native.py #1950

6DammK9 · 2025-02-23T16:28:48Z

Referring to Issue #1947 and PR #1359 .

After code inspection, base (SD1) script fine_tune.py can be merged with the concepts in train_network.py, and becomes train_native.py.
Some network exclusive features (--skip_until_initial_step, --validation_split) has been added,
Tweaked for --mem_eff_attn, --xformers which applies for more aggressive checking (probable still VAE only?)

Tested with SDXL with this CLI command (hint: many features):

accelerate launch sdxl_train.py 
  --pretrained_model_name_or_path="/run/media/user/PM863a/stable-diffusion-webui/models/Stable-diffusion/x215c-AstolfoMix-24101101-6e545a3.safetensors" 
  --in_json "/run/media/user/Intel P4510 3/just_astolfo/test_lat_v3.json" 
  --train_data_dir="/run/media/user/Intel P4510 3/just_astolfo/test" 
  --output_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/model_out" 
  --log_with=tensorboard --logging_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/tensorboard" --log_prefix=just_astolfo_25022301_ 
  --seed=25022301 --save_model_as=safetensors --caption_extension=".txt" --enable_wildcard 
  --use_8bit_adam 
  --learning_rate=1e-6 --train_text_encoder --learning_rate_te1=1e-5 --learning_rate_te2=1e-5 
  --max_train_epochs=4 
  --xformers --mem_eff_attn --torch_compile --dynamo_backend=inductor --gradient_checkpointing 
  --deepspeed --gradient_accumulation_steps=4 --max_grad_norm=0 
  --train_batch_size=1 --full_bf16 --mixed_precision=bf16 --save_precision=fp16 
  --enable_bucket --cache_latents 
  --save_every_n_epochs=2 
  --skip_until_initial_step --initial_step=1 --initial_epoch=1

And the following accelerate config:

accelerate config
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine                                                                                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                                                  
multi-GPU                                                                                                                                                             
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                            
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO                                     
Do you wish to optimize your script with torch dynamo?[yes/NO]:yes                                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which dynamo backend would you like to use?                                                                                                                           
inductor                                                                                                                                                              
Do you want to customize the defaults sent to torch.compile? [yes/NO]: NO                                                                                             
Do you want to use DeepSpeed? [yes/NO]: yes                                                                                                                           
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: NO                                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
What should be your DeepSpeed's ZeRO optimization stage?                                                                                                              
2                                                                                                                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where to offload optimizer states?                                                                                                                                    
none                                                                                                                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where to offload parameters?                                                                                                                                          
none                                                                                                                                                                  
How many gradient accumulation steps you're passing in your script? [1]: 4                                                                                            
Do you want to use gradient clipping? [yes/NO]: NO                                                                                                                    
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO                                                     
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:4
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16                                                                                                                                                                  
accelerate configuration saved at /home/user/.cache/huggingface/accelerate/default_config.yaml

(A bit off topic) It runs for around 15.5s / it (4 cards x 4 accumulation steps) with 4x RTX 3090 24GB (X299 DARK, 10980XE, P4510 4TB).

…py only

6DammK9 · 2025-03-05T07:02:03Z

Expand --skip_cache_check from #1246 to break the freezing on "multinode training". It will skip all os.file.exists, which load the meta_lat.json 1000x times faster (3Hr > 30sec).
Instead of multiple DGX (8x H100) machines, multiple NUMA a.k.a. dual socket CPU machines fall in this case.
Currently testing with --enable_cpu_affinity in accelerate, and faced bottleneck from NUMA / QPI instead of just PCIE bandwidth (i.e. Supermicro X12DPI + SFF8654 risers is slower than EVGA X299 Dark). I'm curious how this trainer (or any SDXL trainer) scales, with either 3090 / 4090 or A100 / H100.
I suspect that it only scales well with DGX tier machines, otherwise it is only suitable for sharding large models (Flux / SD3, but not SDXL). Otherwise just freeze some UNET layers or "train network" for maximum efficiency.

6DammK9 · 2025-03-15T09:37:00Z

Included #1985 (implied #1409, #837 but not #1468). From discussion in GUI repo, there are more memory efficient optimizers other than AdamW8Bit. However, from academic prespective, ADAMW is a good reference point for predicting the parameters for other optimizers. It is most discussed and compared, you don't have to search hyperparameters in an exhausive way. BTW you don't have to modify the secondary parameters (beta, eps etc).

accelerate launch sdxl_train_v2.py 
  --pretrained_model_name_or_path="/run/media/user/Intel P4510 3/astolfo_xl/x255c-AstolfoMix-25022801-1458190.safetensors" 
  --in_json "/run/media/user/Intel P4510 3/just_astolfo/meta_lat_v3.json" 
  --train_data_dir="/run/media/user/Intel P4510 3/just_astolfo/kohyas_finetune" 
  --output_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/model_out" 
  --log_with=tensorboard --logging_dir="/run/media/user/Intel P4510 3/astolfo_xl/just_astolfo/tensorboard" --log_prefix=just_astolfo_25030801_ 
  --seed=25030801 --save_model_as=safetensors --caption_extension=".txt" --enable_wildcard 
  --optimizer_type "pytorch_optimizer.CAME" --optimizer_args "weight_decay=1e-2" --learning_rate="1e-6" --train_text_encoder --learning_rate_te1="1e-5" --learning_rate_te2="1e-5" 
  --max_train_epochs=10 
  --xformers --gradient_checkpointing --gradient_accumulation_steps=4 --max_grad_norm=0 
  --max_data_loader_n_workers=32 --persistent_data_loader_workers  --pin_memory
  --train_batch_size=1 --full_bf16 --mixed_precision=bf16 --save_precision=fp16 
  --enable_bucket --cache_latents --skip_cache_check --save_every_n_epochs=1
#--deepspeed --mem_eff_attn --torch_compile --dynamo_backend=inductor 
#--skip_until_initial_step --initial_step=1 --initial_epoch=1
#numactl --cpunodebind=1 --membind=1

6DammK9 · 2025-04-06T08:06:53Z

Added profiler support:

I was stuck in stalling in all_reduce_training_model stage (HF Accelerate, Pytorch DDP, NCCL) in EPYC platform (comparing with Intel X299), turns out it is because of CCD count.
Look for ProfileKwargs and with accelerator.profile() for details (I hardcoded to output to tmpfs otherwise it will be slow for 400MB traces for each training step).
all_reduce_training_model is required for multiple GPU multiple model trainning (SD = UNET + TE + VAE). This is the exclusive case for SD out of many generative models.

Meanwhile fixed the checking of initial_step calculation and make a version of requirements.txt without any verion cap. accelerate has been in 1.6.0 instead of 0.33.0. I need mega version bump for multi GPU training.

poc: Rewrite fine_tune.py as train_native.py, tested with sdxl_train.…

0de1e00

…py only

This was referenced Feb 23, 2025

use multigpu(8*A800 80G) train flux_train.py OOM problem #1930

Open

Multi-gpus(RTX3090-24GB) Flux finetuning #1791

Open

Darren Lau and others added 2 commits March 5, 2025 14:30

skip npz check for multi node training

b18f20f

Merge branch 'sd3' into train-native

571fa80

6DammK9 and others added 7 commits March 6, 2025 00:14

hotfix and fonud unsolved issue

57cc9ea

pin_memory and skip_npz_check

040c04f

readme

190df71

merge with skip_cache_check, bugfix

42b8d79

hotfix

5e224a8

fix logging / file output

91e3ff1

Importing more optimizer libraries

4a3ced5

6DammK9 added 3 commits March 17, 2025 01:37

adding torchao and fixes

d35c51a

profiler, initial_step fix, experimental req txt

36b0071

revise initial_step

a428a60

6DammK9 added 5 commits April 6, 2025 16:10

Merge branch 'sd3' into train-native

83b1ab6

sync update to train_native

d932129

pytorch profiler options

75712d1

fix requirements.txt

03c82ae

hotfix

4b21fab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PoC: Rewrite fine_tune.py as train_native.py #1950

PoC: Rewrite fine_tune.py as train_native.py #1950

Uh oh!

6DammK9 commented Feb 23, 2025

Uh oh!

6DammK9 commented Mar 5, 2025 •

edited

Loading

Uh oh!

6DammK9 commented Mar 15, 2025

Uh oh!

6DammK9 commented Apr 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

PoC: Rewrite fine_tune.py as train_native.py #1950

Are you sure you want to change the base?

PoC: Rewrite fine_tune.py as train_native.py #1950

Uh oh!

Conversation

6DammK9 commented Feb 23, 2025

Uh oh!

6DammK9 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6DammK9 commented Mar 15, 2025

Uh oh!

6DammK9 commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

6DammK9 commented Mar 5, 2025 •

edited

Loading

6DammK9 commented Apr 6, 2025 •

edited

Loading