Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidTransition: Impossible transition from memory to missing #6125

Closed
mrocklin opened this issue Apr 14, 2022 · 4 comments
Closed

InvalidTransition: Impossible transition from memory to missing #6125

mrocklin opened this issue Apr 14, 2022 · 4 comments

Comments

@mrocklin
Copy link
Member

import coiled
coiled.create_software_environment(
    name="coiled-runtime-chaos",
    conda={"channels": ["coiled", "conda-forge"], "dependencies": ["coiled-runtime", "coiled=0.0.73"]},
    pip=["git+https://github.com/mrocklin/distributed@chaos"],
)

from coiled._beta import ClusterBeta as Cluster
import dask
from dask.distributed import Client

cluster = Cluster(
    software="coiled-runtime-chaos",
    n_workers=10,
    worker_vm_types=["m5.large"],
    scheduler_vm_types=["m5.large"],
    shutdown_on_close=False,
    name="play",
)
client = Client(cluster)

from distributed.chaos import KillWorker
plugin = KillWorker(delay="10 s", mode="sys.exit")
client.register_worker_plugin(plugin, name="kill")

import dask.array as da
x = da.random.random((50000, 50000))
x.rechunk((50000, 20)).rechunk((20, 50000)).sum().compute()
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]: Traceback (most recent call last):
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/utils.py", line 693, in log_errors
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:     yield
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3094, in gather_dep
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:     self.transitions(recommendations, stimulus_id=stimulus_id)
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2607, in transitions
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:     a_recs, a_instructions = self._transition(
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:   File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2543, in _transition
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]:     raise InvalidTransition(
Apr 14 01:42:23 ip-10-4-8-2 cloud-init[989]: distributed.worker_state_machine.InvalidTransition: Impossible transition from memory to missing for ('rechunk-split-a73e77c2dac2f625e22767d7c04cbe17', 1754)

cc @fjetter @gjoseph92

@gjoseph92
Copy link
Collaborator

FWIW I've now ween this in the wild with #6110

2022-04-14 21:17:20,415 - distributed.worker - ERROR - Worker stream died during communication: tls://10.6.5.108:37353
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/tornado/iostream.py", line 1592, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
  File "/opt/conda/envs/coiled/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/conda/envs/coiled/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3019, in gather_dep
    response = await get_data_from_worker(
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 4320, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 4300, in _get_data
    response = await send_recv(
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/core.py", line 709, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TLS (closed) Ephemeral Worker->Worker for gather local=tls://10.6.6.70:44414 remote=tls://10.6.5.108:37353>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-14 21:17:20,442 - distributed.utils - ERROR - Impossible transition from memory to missing for ('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/utils.py", line 693, in log_errors
    yield
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3094, in gather_dep
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2607, in transitions
    a_recs, a_instructions = self._transition(
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2543, in _transition
    raise InvalidTransition(
distributed.worker_state_machine.InvalidTransition: Impossible transition from memory to missing for ('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))
2022-04-14 21:17:20,646 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fc44deef3d0>>, <Task finished name='Task-1771' coro=<Worker.gather_dep() done, defined at /opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py:2963> exception=InvalidTransition("Impossible transition from memory to missing for ('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))")>)
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3094, in gather_dep
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2607, in transitions
    a_recs, a_instructions = self._transition(
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 2543, in _transition
    raise InvalidTransition(
distributed.worker_state_machine.InvalidTransition: Impossible transition from memory to missing for ('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))
2022-04-14 21:17:39,338 - distributed.worker - ERROR - Exception during execution of task ('shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 123).
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3693, in _prepare_args_for_execution
    data[k] = self.data[k]
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/zict/buffer.py", line 87, in __getitem__
    raise KeyError(key)
KeyError: "('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3497, in execute
    args2, kwargs2 = self._prepare_args_for_execution(ts, args, kwargs)
  File "/opt/conda/envs/coiled/lib/python3.9/site-packages/distributed/worker.py", line 3697, in _prepare_args_for_execution
    data[k] = Actor(type(self.actors[k]), self.address, k, self)
KeyError: "('split-shuffle-1-b4961b03aa9e8bec7c581d2dc337f717', 10, (3, 9))"

@mrocklin
Copy link
Member Author

mrocklin commented Apr 15, 2022 via email

@mrocklin
Copy link
Member Author

mrocklin commented Apr 15, 2022 via email

@mrocklin
Copy link
Member Author

I believe that this is resolved by #6123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants