-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
The Socket Timeout is displayed when a PPO is trained on multiple machines
yaml:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
error:
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/ma-user/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_work
ers
self._rendezvous(worker_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 545, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 632, in _assign_worker_r
anks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 669, in _share_and_gathe
r
role_infos_bytes = store_util.synchronize(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
torch.distributed.DistStoreError: Socket Timeout