Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training gets frozen while using multiple-GPUs #29

Open
NathanYanJing opened this issue Feb 25, 2023 · 4 comments
Open

training gets frozen while using multiple-GPUs #29

NathanYanJing opened this issue Feb 25, 2023 · 4 comments

Comments

@NathanYanJing
Copy link

Super cool and amazing work!

I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:

torchrun --nnodes=1 --nproc_per_node=4  train.py --model DiT-XL/2 --data-path training_data --global-batch-size 76 --num-workers 1

The problem I am experiencing is that the training appears to be frozen after creating the experiment directory for a long period of time. On occasion, it also throws the following error:

Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 131 params, while rank 1 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/DiT/DiT/train.py", line 269, in <module>
    main(args)
  File "/DiT/DiT/train.py", line 149, in main
    model = DDP(model.to(device), device_ids=[rank])
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [2]: params[0] in this process with sizes [1152, 4, 1, 1] appears not to match sizes of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225631 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225633 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225634 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 225632) of binary: /miniconda3/envs/DiT/bin/python
Traceback (most recent call last):
  File "/miniconda3/envs/DiT/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/DiT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I have not experienced this problem when training with 1, 2, or 3 nodes.

I apologize for my lack of experience in this area, but could you please provide any insights or guidance to help me resolve this issue? Thank you for your assistance.

@NathanYanJing NathanYanJing changed the title DDP stucks somewhere training gets frozen while using multiple-GPUs Feb 25, 2023
@NathanYanJing
Copy link
Author

NathanYanJing commented Feb 26, 2023

Problems solved for now! In case people might encounter a similar issue,
If you use single node multiple GPU, replace the DDP with the following, there is a hacky way,
from torch.nn import DataParallel as DDP
Or you can try the following
torch.multiprocessing.set_start_method('spawn',force=True) but you might need to rewrite the lambda function to avoid the pickle issue.

@wpeebles
Copy link
Contributor

wpeebles commented Feb 26, 2023

Hi @NathanYanJing. Your torchrun command runs fine for me without any modifications to the code (also using a single-node, multi-GPU training setup). I haven't run across the error you're getting before. Depending on how you're launching the script, you might want to be a little careful with the DDP --> DataParallel change since that could change the behavior of parts of train.py that rely on distributed ops (in general I'm not sure if DataParallel plays nice with torch.distributed)

@NathanYanJing
Copy link
Author

Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice.

Yes, it seems the problem comes back again somehow now -- it hangs at Dataloader part. I am guessing that this is probably NCCL and Nvidia-version issue. Would you mind sharing your NCCL and Cuda versions?

@NathanYanJing NathanYanJing reopened this Feb 26, 2023
@sancunluomu
Copy link

Hi,I encountered a similar issue to yours. I encountered this error at model = DDP(model, device_ids=[rank]) and dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM):

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1289668) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1+cu118', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

After following your method to modify from torch.nn import DataParallel as DDP, I was able to resolve the first error, but the second one still persists.For this kind of situation, do you have any suggestions? Thanks for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants