Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

Pytorch workers keep crashing if master is not up yet. #125

@TimZaman

Description

@TimZaman

I've been observing this for a while now, and now I'm confident this consistently happens (for me):

  • Create a pytorch job:
    A) If the master is up first, things go well, no pods crash
    B) If the master is not up first, the worker pod keep crashing with below error, until the master pod is up, and then things run fine. See below:
 kubectl logs optimizer-worker-0
2019-01-11 05:43:30,310 INFO     main(rmq_host=rmq.default.svc.cluster.local, rmq_port=5672, batch_size=12)
2019-01-11 05:43:30,310 INFO     init_distribution
Traceback (most recent call last):
  File "optimizer.py", line 459, in <module>
    pretrained_model=args.pretrained_model,
  File "optimizer.py", line 422, in main
    init_distribution()
  File "optimizer.py", line 413, in init_distribution
    torch.distributed.init_process_group(backend=backend)
  File "/root/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/root/.local/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
ValueError: host not found: Name or service not known

The reason this occurs is simply because PyTorch workers require the master to be up to connect to. If they cannot connect to the master, they will die, this is intended behaviour (and nothing to do with the pytorch operator or K8s). However, I would expect the pytorch operator to handle this correctly, and bring the master up before the others.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions