Handle exceptions in BatchedSend #4135

TomAugspurger · 2020-09-28T19:57:25Z

This changes exception handling in BatchedSend.

The problem: BatchedSend.send is a regular function. What happens if we
add a message to be sent, but that send fails at a later time? In the
current structure, that exception is logged and then lost to the void.

This PR makes things a bit more robust in two ways:

For unexpected errors, we retry a few times. In
Dask tasks not being shared among workers pangeo-data/pangeo#788 manual retries did
fix the issue, so we try that here automatically.
For CommClosedErrors or repeated unexpected errors, we close the
BatchedSend object. By closing the BatchedSend after an exception,
subsequent BatchedSend.send calls will fail.

We still face the issue with a potential exception happening in the
background that aren't followed by another BatchedSend.send failing.
But I think that's unavoidable given the current design.

I think this better addresses the root problem than #4128. #4128 may still be useful, but I'll revisit it after testing this out.

This changes exception handling in BatchedSend. The problem: BatchedSend.send is a regular function. What happens if we add a message to be sent, but that send fails at a later time? In the current structure, that exception is logged and then lost to the void. This PR makes things a bit more robust in two ways: 1. For unexpected errors, we retry a few times. In pangeo-data/pangeo#788 manual retries did fix the issue, so we try that here automatically. 2. For CommClosedErrors or repeated unexpected errors, we close the BatchedSend object. By closing the BatchedSend after an exception, *subsequent* `BatchedSend.send` calls will fail. We still face the issue with a potential exception happening in the background that *aren't* followed by another `BatchedSend.send` failing. But I think that's unavoidable given the current design.

TomAugspurger · 2020-09-28T20:27:01Z

distributed/tests/test_batched.py

+    #
+    # The reported issues (https://github.com/tornadoweb/tornado/pull/2008)
+    # claim that the BufferError *should* only happen when the application
+    # is incorrectly using threads. I haven't been able to construct an


tornadoweb/tornado#2008 (comment) may actually be relevant here. stackimpact was calling sys._current_frames(). I wonder if our profiler might be causing similar issues? Let me try playing with that. That was apparently from a signal handler though.

TomAugspurger · 2020-09-29T13:17:47Z

Test failure on Windows Python3.6 with tornado 5.x

================================== FAILURES ===================================
___________________________ test_handles_exceptions ___________________________

    @pytest.mark.asyncio
    async def test_handles_exceptions():
        # Ensure that we properly handle exceptions in BatchedSend.
        # https://github.com/pangeo-data/pangeo/issues/788
        # mentioned in https://github.com/dask/distributed/issues/4080, but
        # possibly distinct.
        #
        # The reported issues (https://github.com/tornadoweb/tornado/pull/2008)
        # claim that the BufferError *should* only happen when the application
        # is incorrectly using threads. I haven't been able to construct an
        # actual example, so we mock IOStream.write to raise and ensure that
        # BufferedSend handles things correctly. We don't (yet) test that
        # any *users* of BatchedSend correctly handle BatchedSend dropping
        # messages.
        async with EchoServer() as e:
            comm = await connect(e.address)
            b = BatchedSend(interval=10)
            b.start(comm)
            await asyncio.sleep(0.020)
            orig = comm.stream.write
    
            n = 0
    
            def raise_buffererror(*args, **kwargs):
                nonlocal n
                n += 1
    
                if n == 1:
                    raise BufferError("bad!")
                elif n == 2:
                    orig(*args, **kwargs)
                else:
                    raise CommClosedError
    
            with mock.patch.object(comm.stream, "write", wraps=raise_buffererror):
                b.send("hello")
                b.send("hello")
                b.send("world")
                await asyncio.sleep(0.020)
                result = await comm.read()
                assert result == ("hello", "hello", "world")
    
                b.send("raises when flushed")
                await asyncio.sleep(0.020)  # CommClosedError hit in callback
    
                with pytest.raises(CommClosedError):
>                   b.send("raises when sent")
E                   Failed: DID NOT RAISE <class 'distributed.comm.core.CommClosedError'>

I believe we test python 3.6 and tornado 5 on Linux in https://travis-ci.org/github/dask/distributed/jobs/731248712, so it's likely just a windows & tornado 5 / py36 issue. Going to re-run but I'll likely just skip the test there.

mrocklin · 2020-09-29T17:46:22Z

Otherwise is this resolving the issues that you were seeing with Pangeo workloads?

…

On Tue, Sep 29, 2020 at 6:18 AM Tom Augspurger ***@***.***> wrote: Test failure on Windows Python3.6 with tornado 5.x ================================== FAILURES =================================== ___________________________ test_handles_exceptions ___________________________ @pytest.mark.asyncio async def test_handles_exceptions(): # Ensure that we properly handle exceptions in BatchedSend. # pangeo-data/pangeo#788 # mentioned in #4080, but # possibly distinct. # # The reported issues (tornadoweb/tornado#2008) # claim that the BufferError *should* only happen when the application # is incorrectly using threads. I haven't been able to construct an # actual example, so we mock IOStream.write to raise and ensure that # BufferedSend handles things correctly. We don't (yet) test that # any *users* of BatchedSend correctly handle BatchedSend dropping # messages. async with EchoServer() as e: comm = await connect(e.address) b = BatchedSend(interval=10) b.start(comm) await asyncio.sleep(0.020) orig = comm.stream.write n = 0 def raise_buffererror(*args, **kwargs): nonlocal n n += 1 if n == 1: raise BufferError("bad!") elif n == 2: orig(*args, **kwargs) else: raise CommClosedError with mock.patch.object(comm.stream, "write", wraps=raise_buffererror): b.send("hello") b.send("hello") b.send("world") await asyncio.sleep(0.020) result = await comm.read() assert result == ("hello", "hello", "world") b.send("raises when flushed") await asyncio.sleep(0.020) # CommClosedError hit in callback with pytest.raises(CommClosedError): > b.send("raises when sent") E Failed: DID NOT RAISE <class 'distributed.comm.core.CommClosedError'> I believe we test python 3.6 and tornado 5 on Linux in https://travis-ci.org/github/dask/distributed/jobs/731248712, so it's likely just a windows & tornado 5 / py36 issue. Going to re-run but I'll likely just skip the test there. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTANAY7L2LNMGWVSRGTSIHNAXANCNFSM4R45L2UA> .

TomAugspurger · 2020-09-29T20:32:07Z

Yes, it took a bunch of tries but I think I have confirmation that it did!

2020-09-29T20:21:59.022379955Z dask_gateway.dask_cli - INFO - Requesting scale to 30 workers from 1
2020-09-29T20:22:08.019871943Z distributed.batched - WARNING - Error in batched write, retrying
2020-09-29T20:24:20.094309559Z dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 30

The warning only showed up once, so retrying soon after must have been successful, and we didn't have to retire the worker.

That was using https://binder.pangeo.io/v2/gh/TomAugspurger/pangeo-binder-test/bf5ec5e89fcf46737ec9fb2b6715989815fdf591.

I haven't made much progress on a reproducer. In an attempt to identify profile.watch as the culprit I reduced the tick profile interval and ran the workload repeatedly. It took 3-ish times to reproduce the issue with the default settings, and I didn't reproduce it after 10 tries with the reduced frequency. So I'd say that's weak evidence for the profiler being the issue, but it'd be nice to actually verify that.

mrocklin · 2020-09-29T20:46:53Z

We could also increase the frequency of watch, or introduce a small sleep at an awkward time there if we wanted to trigger this issue.

…

On Tue, Sep 29, 2020 at 1:32 PM Tom Augspurger ***@***.***> wrote: Yes, it took a bunch of tries but I think I have confirmation that it did! 2020-09-29T20:21:59.022379955Z dask_gateway.dask_cli - INFO - Requesting scale to 30 workers from 1 2020-09-29T20:22:08.019871943Z distributed.batched - WARNING - Error in batched write, retrying 2020-09-29T20:24:20.094309559Z dask_gateway.dask_cli - INFO - Requesting scale to 1 workers from 30 The warning only showed up once, so retrying soon after must have been successful, and we didn't have to retire the worker. That was using https://binder.pangeo.io/v2/gh/TomAugspurger/pangeo-binder-test/bf5ec5e89fcf46737ec9fb2b6715989815fdf591 . I haven't made much progress on a reproducer. In an attempt to identify profile.watch as the culprit I reduced the tick profile interval and ran the workload repeatedly. It took 3-ish times to reproduce the issue with the default settings, and I didn't reproduce it after 10 tries with the reduced frequency. So I'd say that's weak evidence for the profiler being the issue, but it'd be nice to actually verify that. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTBJQFFDTGVZAPFAWE3SII75PANCNFSM4R45L2UA> .

TomAugspurger · 2020-09-29T21:17:44Z

increase the frequency watch, or introduce a small sleep at an awkward time there if we wanted to trigger this issue.

Tried both of those already and (surprisingly?) I still couldn't reproduce locally. I really that that introducing a sleep somewhere in process (maybe only if "tornado" in ident") would trigger the condition.

I will test if increasing the profile frequency increases the likelihood that it raises on the pangeo cluster.

TomAugspurger · 2020-09-30T11:38:49Z

With higher tick frequencies

{
    "DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL": "1ms",
    "DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE": "10ms"
}

I did see the failure again with dask 2.28.0, but it's still not occurring reliably. I don't know if that makes the profiler more or less likely to be the culprit.

Short-term, I think merging this would be OK, even without a real reproducer. I'm still uncertain on how fruitful additional debugging will be here. I don't know if it matters, but one thing I noticed while looking at the output of profile is that on dask-gateway we're using SSL. If a socket writing over SSL is slower (is that actually the case?) then we might be more likely to hit this (under the assumption that if bytearray += data takes longer, then we have more time to make an additional reference to it).

TomAugspurger · 2020-10-06T11:25:17Z

@mrocklin coming back to this, are you comfortable merging this despite not having a concrete reproducer? #4135 (comment) has decent evidence that we've fixed the issue for pangeo's use case.

mrocklin · 2020-10-06T14:46:28Z

Yes, that's fine with me. Are we good to merge now? If so, then please go ahead.

TomAugspurger · 2020-10-06T14:56:30Z

Thanks!

This changes exception handling in BatchedSend to be more robust to failures.

TomAugspurger commented Sep 28, 2020

View reviewed changes

fixup close

d9940be

skip

2c8fefe

TomAugspurger mentioned this pull request Sep 29, 2020

Fixed stuck BatchedSend comm #4128

Closed

TomAugspurger merged commit b9dd003 into dask:master Oct 6, 2020

TomAugspurger deleted the pangeo-788-robust-comm branch October 6, 2020 14:56

lr4d pushed a commit to lr4d/distributed that referenced this pull request Oct 9, 2020

Handle exceptions in BatchedSend (dask#4135)

978f788

This changes exception handling in BatchedSend to be more robust to failures.

TomAugspurger mentioned this pull request Oct 9, 2020

Dask tasks not being shared among workers pangeo-data/pangeo#788

Closed

fjetter mentioned this pull request Oct 12, 2020

Flaky test for BatchedSend error handling #4163

Closed

fjetter mentioned this pull request Nov 18, 2020

comm: close comm on low-level errors #4239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Handle exceptions in BatchedSend #4135

Handle exceptions in BatchedSend #4135

Uh oh!

TomAugspurger commented Sep 28, 2020

Uh oh!

TomAugspurger Sep 28, 2020 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

mrocklin commented Sep 29, 2020 via email

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

mrocklin commented Sep 29, 2020 via email

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

TomAugspurger commented Sep 30, 2020 •

edited

Loading

Uh oh!

TomAugspurger commented Oct 6, 2020

Uh oh!

mrocklin commented Oct 6, 2020

Uh oh!

TomAugspurger commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Handle exceptions in BatchedSend #4135

Handle exceptions in BatchedSend #4135

Uh oh!

Conversation

TomAugspurger commented Sep 28, 2020

Uh oh!

TomAugspurger Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

mrocklin commented Sep 29, 2020 via email

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

mrocklin commented Sep 29, 2020 via email

Uh oh!

TomAugspurger commented Sep 29, 2020

Uh oh!

TomAugspurger commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Oct 6, 2020

Uh oh!

mrocklin commented Oct 6, 2020

Uh oh!

TomAugspurger commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomAugspurger Sep 28, 2020 •

edited

Loading

TomAugspurger commented Sep 30, 2020 •

edited

Loading