Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DDP issues and Support DDP for all training scripts #448

Merged
merged 13 commits into from
May 3, 2023

Conversation

Isotr0py
Copy link
Contributor

@Isotr0py Isotr0py commented Apr 25, 2023

@Isotr0py Isotr0py marked this pull request as draft April 25, 2023 17:11
@Isotr0py
Copy link
Contributor Author

Isotr0py commented Apr 25, 2023

Well, some issues occurred using train_db.py with sampling images while train_network.py works normally.
I'm not sure whether other scripts have this issue.
I need some time to locate the issue and fully test all scripts.

@alleniver
Copy link

nice work!

@Isotr0py Isotr0py marked this pull request as ready for review April 26, 2023 05:38
@kohya-ss
Copy link
Owner

Thank you for this! It looks good!

Unfortunately I am unable to test in a distributed training, but as soon as I have time, I will verify that it works on a single GPU as well.

@kohya-ss kohya-ss changed the base branch from main to dev May 3, 2023 01:23
@kohya-ss kohya-ss merged commit e1143ca into kohya-ss:dev May 3, 2023
@Isotr0py Isotr0py deleted the DDP_fix branch May 3, 2023 02:19
@kohya-ss
Copy link
Owner

kohya-ss commented May 3, 2023

I've merged. Thank you again!

@bmaltais bmaltais mentioned this pull request May 5, 2023
@nofacedeepfake
Copy link

Hello, thank you very much for developing this direction, how do I run two 3090ti?

@Isotr0py
Copy link
Contributor Author

Isotr0py commented May 6, 2023

@nofacedeepfake You may need to change accelerate config as below:

- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}

So you can do as follows:

  • Method1: run accelerate config and answer:
    Which type of machine are you using? -> multi-GPU
    How many GPU(s) should be used for distributed training? -> <Number of GPUs> (here is 2)
    Other than that, the default settings are fine.

  • Method2: add --multi_gpu at your training commandline like:
    accelerate launch --multi_gpu --num_processes=2 train_network.py

@nofacedeepfake
Copy link

Thank you very much, and can you describe in detail in what file I replace it ? I don't quite understand where to change - compute_environment: LOCAL_MACHINE

  • distributed_type: MULTI_GPU
  • mixed_precision: no
  • use_cpu: False
  • num_processes: 2
  • machine_rank: 0
  • num_machines: 1
  • main_process_ip: None
  • main_process_port: None
  • main_training_function: main
  • deepspeed_config: {}
  • fsdp_config: {}

and add --multi_gpu on your training command line, e.g:
accelerate launch --multi_gpu --num_processes=2 train_network.py

@nofacedeepfake Возможно, вам потребуется изменить конфигурацию ускорения, как показано ниже:

- компьютерная среда: LOCAL_MACHINE
- распространяемый тип: MULTI_GPU
- смешанная точность: НЕТ
- use_cpu: Ложь
- нум_процессы: 2
- machine_rank: 0
- num_machines: 1
- основной процесс_ip: Нет
- основной процесс_порт: Нет
- основная обучающая функция: Главная
- deepspeed_config: {}
- fsdp_config: {}

Итак, вы можете сделать следующее:

  • Метод1: выполнить accelerate config и ответить:
    Which type of machine are you using? -> мульти-графический процессор
    How many GPU(s) should be used for distributed training? -> <Количество графических процессоров> (здесь 2) В остальном настройки по умолчанию в порядке.
  • Метод2: добавить --multi_gpu в командной строке вашего обучения, например:
    accelerate launch --multi_gpu --num_processes=2 train_network.py

@Isotr0py
Copy link
Contributor Author

Isotr0py commented May 9, 2023

@nofacedeepfake My answer may be unclear. No need to replace any file manually.

You just need to run accelerate config to reset the config file. Accelerate Config Doc and Accelerate CMD Doc may be helpful.

And if you don't want to change the accelerate config file, you can just change the train command line to run training script like:
accelerate launch --multi_gpu --num_processes=2 train_network.py

@nofacedeepfake
Copy link

no method worked
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.).

C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.). prepare accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.). C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( Traceback (most recent call last): File "C:\Kohya\kohya_ss\train_network.py", line 773, in [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.). train(args) File "C:\Kohya\kohya_ss\train_network.py", line 139, in train [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-VTVG7AS]:29500 (system error: 10049 - ╥Ёхсєхь√щ рфЁхё фы  ётюхую ъюэЄхъёЄр эхтхЁхэ.). accelerator, unwrap_model = train_util.prepare_accelerator(args)Traceback (most recent call last): File "C:\Kohya\kohya_ss\train_network.py", line 773, in File "C:\Kohya\kohya_ss\library\train_util.py", line 2875, in prepare_accelerator train(args) File "C:\Kohya\kohya_ss\train_network.py", line 139, in train accelerator = Accelerator(accelerator, unwrap_model = train_util.prepare_accelerator(args) File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 346, in init File "C:\Kohya\kohya_ss\library\train_util.py", line 2875, in prepare_accelerator self.state = AcceleratorState( File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 540, in init accelerator = Accelerator(PartialState(cpu, **kwargs) File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 346, in init File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 129, in init self.state = AcceleratorState(torch.distributed.init_process_group(backend="nccl", **kwargs) File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 540, in init File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group PartialState(cpu, **kwargs)default_pg = _new_process_group_helper( File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 129, in init File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper torch.distributed.init_process_group(backend="nccl", **kwargs)raise RuntimeError("Distributed package doesn't have NCCL " "built in") File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_groupRuntimeError : default_pg = _new_process_group_helper(Distributed package doesn't have NCCL built in File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45044) of binary: C:\Kohya\kohya_ss\venv\Scripts\python.exe Traceback (most recent call last): File "C:\Users\79255\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\79255\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Kohya\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 914, in launch_command multi_gpu_launcher(args) File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 603, in multi_gpu_launcher distrib_run.run(args) File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Kohya\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent

@nofacedeepfake
Copy link

There are three video cards in my system

@nofacedeepfake
Copy link

2 pieces of 3090ti nvlink and 3060

@nofacedeepfake
Copy link

accelerate launch --gpu_ids=0,1 --multi_gpu --num_processes=2 --num_cpu_threads_per_process=8 "train_network.py"
and gives the same error

@Isotr0py
Copy link
Contributor Author

Isotr0py commented May 13, 2023

@nofacedeepfake I think it's an error from windows environment without NCCL.
According to #247 , you can add this code in below to the top of train_network.py:

import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

It's reported to work on windows, but I don't have a windows environment with multiple GPUs to test it :(

@nofacedeepfake
Copy link

https://ibb.co/48870VD

@CCJetWing
Copy link

I have set accelerate launch arguements but have another error:

NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1027, in
main()
File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1023, in main
launch_command(args)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
result = agent.run()
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
result = self._invoke_run(role)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "D:\NEW\lora-scripts\Venv\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: unmatched '}' in format string

How to fix this?

@kohya-ss
Copy link
Owner

According to this pytorch/pytorch#100185 , maybe, I don't think this is sd-scripts' issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants