Skip to content

The Socket Timeout is displayed when a PPO is trained on multiple machines #3658

@duyuwen-duen

Description

@duyuwen-duen

The Socket Timeout is displayed when a PPO is trained on multiple machines
yaml:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

error:
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_work
ers
self._rendezvous(worker_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 545, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 632, in _assign_worker_r
anks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 669, in _share_and_gathe
r
role_infos_bytes = store_util.synchronize(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
torch.distributed.DistStoreError: Socket Timeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions