(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (pytorch#63910) (pytorch#64826)

cbalioglu · Kiuk Chung · web-flow · commit e2cb35736777 · 2021-09-12T10:38:42.000-07:00
Summary: Pull Request resolved: pytorch#63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: pytorch#63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5 Co-authored-by: Kiuk Chung <kiuk@fb.com>
diff --git a/docs/source/distributed.rst b/docs/source/distributed.rst
@@ -180,6 +180,8 @@ joined.
 
 .. autofunction:: is_nccl_available
 
+.. autofunction:: is_torchelastic_launched
+
 --------------------------------------------------------------------------------
 
 Currently three initialization methods are supported:
diff --git a/test/distributed/launcher/bin/test_script_init_method.py b/test/distributed/launcher/bin/test_script_init_method.py
@@ -0,0 +1,76 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import os
+
+import torch
+import torch.distributed as dist
+import torch.nn.functional as F
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="test script")
+
+    parser.add_argument(
+        "--init_method",
+        type=str,
+        required=True,
+        help="init_method to pass to `dist.init_process_group()` (e.g. env://)",
+    )
+    parser.add_argument(
+        "--world_size",
+        type=int,
+        default=os.getenv("WORLD_SIZE", -1),
+        help="world_size to pass to `dist.init_process_group()`",
+    )
+    parser.add_argument(
+        "--rank",
+        type=int,
+        default=os.getenv("RANK", -1),
+        help="rank to pass to `dist.init_process_group()`",
+    )
+
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    dist.init_process_group(
+        backend="gloo",
+        init_method=args.init_method,
+        world_size=args.world_size,
+        rank=args.rank,
+    )
+
+    rank = dist.get_rank()
+    world_size = dist.get_world_size()
+
+    # one hot (by rank) tensor of size world_size
+    # example:
+    # rank 0, world_size 4 => [1, 0, 0, 0]
+    # rank 1, world_size 4 => [0, 1, 0, 0]
+    # ...
+    t = F.one_hot(torch.tensor(rank), num_classes=world_size)
+
+    # after all_reduce t = tensor.ones(size=world_size)
+    dist.all_reduce(t)
+
+    # adding all elements in t should equal world_size
+    derived_world_size = torch.sum(t).item()
+    if derived_world_size != world_size:
+        raise RuntimeError(
+            f"Wrong world size derived. Expected: {world_size}, Got: {derived_world_size}"
+        )
+
+    print("Done")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/test/distributed/launcher/bin/test_script_is_torchelastic_launched.py b/test/distributed/launcher/bin/test_script_is_torchelastic_launched.py
@@ -0,0 +1,42 @@
+#!/usr/bin/env python3
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+This is a test script that launches as part of the test cases in
+run_test.py, to validate the correctness of
+the method ``torch.distributed.is_torchelastic_launched()``. To do so,
+we run this script with and without torchelastic and validate that the
+boolean value written to the out_file is indeed what we expect (e.g.
+should be False when not launched with torchelastic, True when launched with)
+The script itself is not a test case hence no assertions are made in this script.
+
+see: - test/distributed/launcher/run_test.py#test_is_torchelastic_launched()
+     - test/distributed/launcher/run_test.py#test_is_not_torchelastic_launched()
+"""
+import argparse
+
+import torch.distributed as dist
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="test script")
+    parser.add_argument(
+        "--out_file",
+        help="file to write indicating whether this script was launched with torchelastic",
+    )
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    with open(args.out_file, "w") as out:
+        out.write(f"{dist.is_torchelastic_launched()}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/test/distributed/launcher/run_test.py b/test/distributed/launcher/run_test.py
@@ -7,8 +7,10 @@
 # LICENSE file in the root directory of this source tree.
 import multiprocessing as mp
 import os
+import runpy
 import shutil
 import subprocess
+import sys
 import tempfile
 import unittest
 import uuid
@@ -21,6 +23,7 @@
 from torch.distributed.elastic.multiprocessing.errors import ChildFailedError
 from torch.distributed.elastic.rendezvous.etcd_server import EtcdServer
 from torch.distributed.elastic.utils import get_socket_with_port
+from torch.distributed.elastic.utils.distributed import get_free_port
 from torch.testing._internal.common_utils import (
     TEST_WITH_ASAN,
     TEST_WITH_TSAN,
@@ -476,3 +479,117 @@ def test_launch_shutdown(self, agent_mock_cls):
             param_mock.return_value = rdzv_handler_mock
             launch.main(args)
             rdzv_handler_mock.shutdown.assert_called_once()
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_is_torchelastic_launched(self):
+        # launch test script with torchelastic and validate that
+        # torch.distributed.is_torchelastic_launched() returns True
+
+        out_file = f"{os.path.join(self.test_dir, 'out')}"
+
+        launch.main(
+            [
+                "--run_path",
+                "--nnodes=1",
+                "--nproc_per_node=1",
+                "--monitor_interval=1",
+                path("bin/test_script_is_torchelastic_launched.py"),
+                f"--out_file={out_file}",
+            ]
+        )
+
+        with open(out_file, "r") as fp:
+            is_torchelastic_launched = fp.readline()
+            self.assertEqual("True", is_torchelastic_launched)
+
+    def test_is_not_torchelastic_launched(self):
+        # launch test script without torchelastic and validate that
+        # torch.distributed.is_torchelastic_launched() returns False
+
+        out_file = f"{os.path.join(self.test_dir, 'out')}"
+
+        # need to run the script with runpy in the same interpreter
+        # as the test because otherwise (depending on the environment)
+        # it will not find torch as a dependency
+        with patch.object(
+            sys,
+            "argv",
+            [
+                path("bin/test_script_is_torchelastic_launched.py"),
+                f"--out_file={out_file}",
+            ],
+        ):
+            runpy.run_path(sys.argv[0], run_name="__main__")
+            with open(out_file, "r") as fp:
+                is_torchelastic_launched = fp.readline()
+                self.assertEqual("False", is_torchelastic_launched)
+
+    def test_init_method_tcp(self):
+        port = get_free_port()
+        with patch.object(
+            sys,
+            "argv",
+            [
+                path("bin/test_script_init_method.py"),
+                f"--init_method=tcp://localhost:{port}",
+                "--rank=0",
+                "--world_size=1",
+            ],
+        ):
+            runpy.run_path(sys.argv[0], run_name="__main__")
+            # nothing to validate, just make sure it runs
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_init_method_tcp_with_torchelastic(self):
+        port = get_free_port()
+        launch.main(
+            [
+                "--run_path",
+                "--nnodes=1",
+                "--nproc_per_node=4",
+                "--master_addr=localhost",
+                f"--master_port={port}",
+                "--monitor_interval=1",
+                path("bin/test_script_init_method.py"),
+                f"--init_method=tcp://localhost:{port}",
+            ]
+        )
+        # nothing to validate, just make sure it runs
+
+    def test_init_method_env(self):
+        port = get_free_port()
+        with patch.dict(
+            os.environ,
+            {
+                "RANK": "0",
+                "WORLD_SIZE": "1",
+                "MASTER_ADDR": "localhost",
+                "MASTER_PORT": str(port),
+            },
+        ), patch.object(
+            sys,
+            "argv",
+            [
+                path("bin/test_script_init_method.py"),
+                "--init_method=env://",
+            ],
+        ):
+            runpy.run_path(sys.argv[0], run_name="__main__")
+            # nothing to validate, just make sure it runs
+
+    @sandcastle_skip_if(TEST_WITH_DEV_DBG_ASAN, "test incompatible with dev/dbg asan")
+    def test_init_method_env_with_torchelastic(self):
+        port = get_free_port()
+        launch.main(
+            [
+                "--run_path",
+                "--nnodes=1",
+                "--nproc_per_node=4",
+                "--master_addr=localhost",
+                f"--master_port={port}",
+                "--monitor_interval=1",
+                path("bin/test_script_init_method.py"),
+                "--init_method=env://",
+            ]
+        )
+        # nothing to validate, just make sure it runs
diff --git a/torch/distributed/distributed_c10d.py b/torch/distributed/distributed_c10d.py
@@ -1,5 +1,6 @@
 import contextlib
 import logging
+import os
 import pickle
 import io
 import torch
@@ -15,21 +16,22 @@
 from .constants import default_pg_timeout
 from .rendezvous import rendezvous, register_rendezvous_handler  # noqa: F401
 from torch._C._distributed_c10d import (
-    AllreduceOptions,
     AllreduceCoalescedOptions,
+    AllreduceOptions,
     AllToAllOptions,
     BarrierOptions,
     BroadcastOptions,
     GatherOptions,
     PrefixStore,
     ProcessGroup,
-    ReduceOptions,
     ReduceOp,
+    ReduceOptions,
     ReduceScatterOptions,
     ScatterOptions,
     Store,
 )
 
+
 _MPI_AVAILABLE = True
 _NCCL_AVAILABLE = True
 _GLOO_AVAILABLE = True
@@ -350,6 +352,18 @@ def is_initialized():
     return GroupMember.WORLD is not None
 
 
+def is_torchelastic_launched():
+    """
+    Checks whether this process was launched with ``torch.distributed.elastic``
+    (aka torchelastic). The existence of ``TORCHELASTIC_RUN_ID`` environment
+    variable is used as a proxy to determine whether the current process
+    was launched with torchelastic. This is a reasonable proxy since
+    ``TORCHELASTIC_RUN_ID`` maps to the rendezvous id which is always a
+    non-null value indicating the job id for peer discovery purposes..
+    """
+    return os.getenv("TORCHELASTIC_RUN_ID") is not None
+
+
 def _get_default_group():
     """
     Getting the default process group created by init_process_group
diff --git a/torch/distributed/launch.py b/torch/distributed/launch.py
@@ -97,9 +97,9 @@
     >>>    # your code to run
 
 3. In your training program, you are supposed to call the following function
-at the beginning to start the distributed backend. You need to make sure that
-the init_method uses ``env://``, which is the only supported ``init_method``
-by this module.
+at the beginning to start the distributed backend. It is strongly recommended
+that ``init_method=env://``. Other init methods (e.g. ``tcp://``) may work,
+but ``env://`` is the one that is officially supported by this module.
 
 ::
 
@@ -147,6 +147,7 @@
 
 from torch.distributed.run import get_args_parser, run
 
+
 logger = logging.getLogger(__name__)
 
 
@@ -181,7 +182,8 @@ def main(args=None):
         "If your script expects `--local_rank` argument to be set, please\n"
         "change it to read from `os.environ['LOCAL_RANK']` instead. See \n"
         "https://pytorch.org/docs/stable/distributed.html#launch-utility for \n"
-        "further instructions\n", FutureWarning
+        "further instructions\n",
+        FutureWarning,
     )
     args = parse_args(args)
     launch(args)
diff --git a/torch/distributed/rendezvous.py b/torch/distributed/rendezvous.py
diff --git a/torch/distributed/run.py b/torch/distributed/run.py