training gets frozen while using multiple-GPUs #29

NathanYanJing · 2023-02-25T05:44:10Z

Super cool and amazing work!

I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:

torchrun --nnodes=1 --nproc_per_node=4  train.py --model DiT-XL/2 --data-path training_data --global-batch-size 76 --num-workers 1

The problem I am experiencing is that the training appears to be frozen after creating the experiment directory for a long period of time. On occasion, it also throws the following error:

Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 131 params, while rank 1 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [2]: params[0] in this process with sizes [1152, 4, 1, 1] appears not to match sizes of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225631 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225633 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225634 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 225632) of binary: /miniconda3/envs/DiT/bin/python
Traceback (most recent call last):
  File "/miniconda3/envs/DiT/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I have not experienced this problem when training with 1, 2, or 3 nodes.

I apologize for my lack of experience in this area, but could you please provide any insights or guidance to help me resolve this issue? Thank you for your assistance.

The text was updated successfully, but these errors were encountered:

NathanYanJing · 2023-02-26T07:30:04Z

Problems solved for now! In case people might encounter a similar issue,
If you use single node multiple GPU, replace the DDP with the following, there is a hacky way,
from torch.nn import DataParallel as DDP
Or you can try the following
torch.multiprocessing.set_start_method('spawn',force=True) but you might need to rewrite the lambda function to avoid the pickle issue.

wpeebles · 2023-02-26T10:46:06Z

Hi @NathanYanJing. Your torchrun command runs fine for me without any modifications to the code (also using a single-node, multi-GPU training setup). I haven't run across the error you're getting before. Depending on how you're launching the script, you might want to be a little careful with the DDP --> DataParallel change since that could change the behavior of parts of train.py that rely on distributed ops (in general I'm not sure if DataParallel plays nice with torch.distributed)

NathanYanJing · 2023-02-26T19:18:26Z

Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice.

Yes, it seems the problem comes back again somehow now -- it hangs at Dataloader part. I am guessing that this is probably NCCL and Nvidia-version issue. Would you mind sharing your NCCL and Cuda versions?

sancunluomu · 2024-12-23T13:34:42Z

Hi，I encountered a similar issue to yours. I encountered this error at model = DDP(model, device_ids=[rank]) and dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM):

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1289668) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1+cu118', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

After following your method to modify from torch.nn import DataParallel as DDP, I was able to resolve the first error, but the second one still persists.For this kind of situation, do you have any suggestions? Thanks for your reply！

NathanYanJing changed the title ~~DDP stucks somewhere~~ training gets frozen while using multiple-GPUs Feb 25, 2023

NathanYanJing closed this as completed Feb 26, 2023

NathanYanJing reopened this Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training gets frozen while using multiple-GPUs #29

training gets frozen while using multiple-GPUs #29

NathanYanJing commented Feb 25, 2023

NathanYanJing commented Feb 26, 2023 •

edited

Loading

wpeebles commented Feb 26, 2023 •

edited

Loading

NathanYanJing commented Feb 26, 2023

sancunluomu commented Dec 23, 2024

training gets frozen while using multiple-GPUs #29

training gets frozen while using multiple-GPUs #29

Comments

NathanYanJing commented Feb 25, 2023

NathanYanJing commented Feb 26, 2023 • edited Loading

wpeebles commented Feb 26, 2023 • edited Loading

NathanYanJing commented Feb 26, 2023

sancunluomu commented Dec 23, 2024

NathanYanJing commented Feb 26, 2023 •

edited

Loading

wpeebles commented Feb 26, 2023 •

edited

Loading