Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Flaky Hexagon RPC server in tests #13205

Closed
driazati opened this issue Oct 26, 2022 · 13 comments
Closed

[ci] Flaky Hexagon RPC server in tests #13205

driazati opened this issue Oct 26, 2022 · 13 comments
Labels
test: flaky type:ci Relates to TVM CI infrastructure

Comments

@driazati
Copy link
Member

driazati commented Oct 26, 2022

Seen on main in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/4580/tests/, the RPC server name is also the same for every failing test and the tests all failed on the same shard

failed on setup with "RuntimeError: Cannot request hexagon-dev.5788 after 5 retry, last_error:Traceback (most recent call last):
  5: TVMFuncCall
        at /workspace/src/runtime/c_runtime_api.cc:477
  4: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /workspace/include/tvm/runtime/packed_func.h:1217
  3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::$_0> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
        at /workspace/include/tvm/runtime/packed_func.h:1213
  2: tvm::runtime::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /workspace/src/runtime/rpc/rpc_socket_impl.cc:132
  1: tvm::runtime::RPCClientConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, tvm::runtime::TVMArgs)
        at /workspace/src/runtime/rpc/rpc_socket_impl.cc:112
  0: tvm::runtime::RPCConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, tvm::runtime::TVMArgs)
        at /workspace/src/runtime/rpc/rpc_socket_impl.cc:72
  File "/workspace/src/runtime/rpc/rpc_socket_impl.cc", line 72
TVMError: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (sock.Connect(addr)) is false: Connect to 127.0.0.1:65535 failed"
Stacktrace
request = <FixtureRequest for <Function test_reduce_map[in_shape0-0-False-argmax-float32]>>
    def fill(request):
        item = request._pyfuncitem
        fixturenames = getattr(item, "fixturenames", None)
        if fixturenames is None:
            fixturenames = request.fixturenames
    
        if hasattr(item, 'callspec'):
            for param, val in sorted_by_dependency(item.callspec.params, fixturenames):
                if val is not None and is_lazy_fixture(val):
                    item.callspec.params[param] = request.getfixturevalue(val.name)
                elif param not in item.funcargs:
>                   item.funcargs[param] = request.getfixturevalue(param)
/venv/apache-tvm-py3.8/lib/python3.8/site-packages/pytest_lazyfixture.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
python/tvm/contrib/hexagon/pytest_plugin.py:278: in hexagon_session
    with hexagon_launcher.create_session() as session:
python/tvm/contrib/hexagon/session.py:109: in __enter__
    raise exception
python/tvm/contrib/hexagon/session.py:92: in __enter__
    self._rpc = tracker.request(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <tvm.rpc.client.TrackerSession object at 0x7f14670bc880>
key = 'hexagon-dev.5788', priority = 0, session_timeout = 0, max_retry = 5
session_constructor_args = ['tvm.contrib.hexagon.create_hexagon_session', 'hexagon-rpc', 262144, '', 268435456]
    def request(
        self, key, priority=1, session_timeout=0, max_retry=5, session_constructor_args=None
    ):
        """Request a new connection from the tracker.
    
        Parameters
        ----------
        key : str
            The type key of the device.
    
        priority : int, optional
            The priority of the request.
    
        session_timeout : float, optional
            The duration of the session, allows server to kill
            the connection when duration is longer than this value.
            When duration is zero, it means the request must always be kept alive.
    
        max_retry : int, optional
            Maximum number of times to retry before give up.
    
        session_constructor_args : list, optional
            List of additional arguments to passed as the remote session constructor.
            The first element of the list is always a string specifying the name of
            the session constructor, the following args are the positional args to that function.
        """
        last_err = None
        for _ in range(max_retry):
            try:
                if self._sock is None:
                    self._connect()
                base.sendjson(self._sock, [base.TrackerCode.REQUEST, key, "", priority])
                value = base.recvjson(self._sock)
                if value[0] != base.TrackerCode.SUCCESS:
                    raise RuntimeError("Invalid return value %s" % str(value))
                url, port, matchkey = value[1]
                return connect(
                    url,
                    port,
                    matchkey,
                    session_timeout,
                    session_constructor_args=session_constructor_args,
                )
            except socket.error as err:
                self.close()
                last_err = err
            except TVMError as err:
                last_err = err
>       raise RuntimeError(
            "Cannot request %s after %d retry, last_error:%s" % (key, max_retry, str(last_err))
        )
E       RuntimeError: Cannot request hexagon-dev.5788 after 5 retry, last_error:Traceback (most recent call last):
E         5: TVMFuncCall
E               at /workspace/src/runtime/c_runtime_api.cc:477
E         4: tvm::runtime::PackedFuncObj::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
E               at /workspace/include/tvm/runtime/packed_func.h:1217
E         3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::$_0> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
E               at /workspace/include/tvm/runtime/packed_func.h:1213
E         2: tvm::runtime::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
E               at /workspace/src/runtime/rpc/rpc_socket_impl.cc:132
E         1: tvm::runtime::RPCClientConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, tvm::runtime::TVMArgs)
E               at /workspace/src/runtime/rpc/rpc_socket_impl.cc:112
E         0: tvm::runtime::RPCConnect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, tvm::runtime::TVMArgs)
E               at /workspace/src/runtime/rpc/rpc_socket_impl.cc:72
E         File "/workspace/src/runtime/rpc/rpc_socket_impl.cc", line 72
E       TVMError: 
E       ---------------------------------------------------------------
E       An error occurred during the execution of TVM.
E       For more information, please see: https://tvm.apache.org/docs/errors.html
E       ---------------------------------------------------------------
E         Check failed: (sock.Connect(addr)) is false: Connect to 127.0.0.1:65535 failed
python/tvm/rpc/client.py:416: RuntimeError

cc @Mousius @areusch @gigiblender @mehrdadh

@driazati driazati added test: flaky type:ci Relates to TVM CI infrastructure labels Oct 26, 2022
@mehrdadh
Copy link
Member

cc @kparzysz-quic

@mehrdadh
Copy link
Member

@driazati Looks like it is trying to access this port 127.0.0.1:65535
Is this port accessible on our CI?

@driazati
Copy link
Member Author

if its all within the same container it should be since its just on the loopback address, is there any kind of random port selection or generation in the setup that's missing a random.seed(0) so these choices could be stabilized?

@mehrdadh
Copy link
Member

@mehrdadh
Copy link
Member

@driazati but this function shouldn't allow 65535 to be generated since it is bound between MIN and MAX port

@mehrdadh
Copy link
Member

@driazati I think this function has a bug. I'll send a PR to fix it

@mehrdadh
Copy link
Member

#13207 might fix this issue

@supersat
Copy link
Contributor

Disabling random port numbers is a big productivity killer since I can no longer re-run pytest while the port is still in TIME_WAIT

@driazati
Copy link
Member Author

How about we only seed if tvm.testing.IS_IN_CI to cover both cases?

@mehrdadh
Copy link
Member

@supersat I think it's not disabled, it will only start from the same port and increases the port number until it finds a valid port. wdyt?

@supersat
Copy link
Contributor

@mehrdadh I would say it's effectively disabled because it always tries the same sequence of port numbers. The tracker does come up on the next available port if the one it tries is taken, but then the launcher tries to get adb to forward a bunch of different ports, which are all in use and causes tests to fail.

Seeding the RNG with a fixed value only in CI (with tvm.testing.IS_IN_CI) is probably the right approach.

@mehrdadh
Copy link
Member

mehrdadh commented Nov 4, 2022

@driazati can we close this?

@driazati
Copy link
Member Author

driazati commented Nov 4, 2022

the immediate issue is fixed so sure

@driazati driazati closed this as completed Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test: flaky type:ci Relates to TVM CI infrastructure
Projects
None yet
Development

No branches or pull requests

3 participants