-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DDP issues and Support DDP for all training scripts #448
Conversation
Isotr0py
commented
Apr 25, 2023
•
edited
Loading
edited
- Fix [Bug] LORA train.py fails in DDP multi-gpu mode with --network_train_unet_only flag because of None #365
- Fix [Request] Fix multi-gpu training for train_db.py and fine_tune.py #359
- Suppport DDP for all training scripts
Well, some issues occurred using |
nice work! |
Thank you for this! It looks good! Unfortunately I am unable to test in a distributed training, but as soon as I have time, I will verify that it works on a single GPU as well. |
I've merged. Thank you again! |
Hello, thank you very much for developing this direction, how do I run two 3090ti? |
@nofacedeepfake You may need to change accelerate config as below: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {} So you can do as follows:
|
Thank you very much, and can you describe in detail in what file I replace it ? I don't quite understand where to change - compute_environment: LOCAL_MACHINE
and add --multi_gpu on your training command line, e.g:
|
@nofacedeepfake My answer may be unclear. No need to replace any file manually. You just need to run And if you don't want to change the accelerate config file, you can just change the train command line to run training script like: |
no method worked C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: |
There are three video cards in my system |
2 pieces of 3090ti nvlink and 3060 |
accelerate launch --gpu_ids=0,1 --multi_gpu --num_processes=2 --num_cpu_threads_per_process=8 "train_network.py" |
@nofacedeepfake I think it's an error from windows environment without import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" It's reported to work on windows, but I don't have a windows environment with multiple GPUs to test it :( |
I have set accelerate launch arguements but have another error: NOTE: Redirects are currently not supported in Windows or MacOs. How to fix this? |
According to this pytorch/pytorch#100185 , maybe, I don't think this is sd-scripts' issue... |